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Dedication 
March 9th of 2021 was one of the saddest days on record for me and I’m sure most of my colleagues. 


On this day Kishan Baheti was taken from us by the covid virus. 
To my dear friend Kishan, this book is dedicated to you. 
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Preface 


During the spring semester of 2020 I was delivering my stochastic control course, for which the 
final weeks always focus on topics in reinforcement learning (RL). Throughout the semester I was 
thinking ahead to crash courses on this topic to be delivered later in the year: two summer courses 
planned in Paris and Berlin, and another scheduled as part of the Simons Institute program on 
reinforcement learning.| A pandemic altered my travel plans, but also gave me time to reflect on 
how to better teach this difficult material. 

Soon after the Spring semester ended I was contacted by Diana Gillooly, then an editor at 
Cambridge University Press. She wrote “...someone mentioned that you were scheduled to deliver 
lectures on reinforcement learning...”, and asked if I might be interested in writing a book on the 
topic. Her brief email set in motion the pages that lie in front of you. 

There is of course a longer history: the book is a product of handouts I’ve prepared over more 
than a decade, and bits and pieces of papers and book chapters prepared over a longer period. 

However, I promised my co-organizers of the Simons Institute RL program that I would present 
a crash course for true beginners, without heavy mathematics. I also promised myself I would write 
a book that would be accessible to senior undergraduates and graduate students, provided they 
came with sufficient motivation. These pandemic-induced contemplations led to two themes that I 
felt a need to spell out: 


(i) Within the control systems literature, there are dynamic programming techniques to ap- 
proximate the Q-function that appears in reinforcement learning. In particular, this “value 
function” is the solution to a simple convex program (an example is the “DPLP” appearing in 
eq. (3.36)). Many of the algorithms in reinforcement learning are designed to approximate the 
same function, but are based on root finding problems that are not well understood outside 
of very special cases. 

This is just one example of the need to build better bridges between control and RL. I 
cannot claim the bridge is fully built. My hope is that the book will provide leads for future 
discoveries based on insights from each discipline. 


(ii) Stochastic approximation (SA) is the most common method for analysis of recursive algo- 
rithms; this approach is commonly known as the ODE method [136, 229, 357, 301]. The 
relationship between RL and SA was recognized soon after Watkins introduced Q-learning 
[352, 169], and ODE methods for analysis of optimization algorithms have grown in sophisti- 
cation over the past decade [335, 375, 318, 198]. Related ODE methods are part of a standard 
modeling framework in statistical mechanics, genetics, epidemiology (e.g., the SIR model), 
and even voting [122, 276, 225, 24]. 

The narrative is flipped in this book: rather than treating an ODE as simply an analytical 
tool, every algorithm in this book begins with an ideal ODE that is regarded as “step 1” in 
algorithm design. I believe this provides better insight in algorithm synthesis and analysis. 

However, justification of this approach using SA is highly technical. In particular, the re- 
cent thesis and book chapter [110, 107] (which build on a similar narrative) assume significant 
background in the theory of stochastic processes. In this book we lift the veil: there is nothing 
inherently stochastic about stochastic approximation, provided you are willing to work with 
sinusoids or other deterministic probing signals instead of stochastic processes. The ODE 
methods surveyed in Chapters 4 and 5 make no reference to probability theory. 


‘Video and slides from the 2020 fall program are now available at https: //simons. berkeley. edu/programs/r120 
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This is my third book, and like the others the writing came with discoveries. While working 
on theme (i), my colleague Prashant Mehta and I found that Convex Q-Learning could be made 
much more practical by borrowing batch RL concepts that are currently popular. This led to the 
new work [247, 230], and new collaborations with Gergely Neu. You will find text and equations 
from these papers scattered throughout Chapters 3 and 5. 

Chapter 4 on ODE methods and quasi-stochastic approximation was to be built primarily on 
(40, 41]. Over the summer of 2020 all of this material was generalized to create a complete theory of 
convergence and convergence rates for these algorithms, along with a better understanding of their 
application to both gradient free optimization and policy gradient techniques for RL [86, 87, 88]. 

Many new concepts may be found in the second part of the book, such as Zap Zero for Q- 
learning, and insights on the rate of convergence for actor-critic methods. Each chapter ends with 
a ‘Notes’ section that provides an overview of the origins of the main conclusions. 


Many newcomers to reinforcement learning may be disappointed to see that the theory and algo- 
rithms in this book are far removed from the dream portrayed in the popular media: reinforcement 
learning is often described as an “agent” interacting in a physical environment, and maturing as it 
gains experience. Unfortunately, given today’s technologies, the process of “learning from scratch 
while you control” is unlikely to succeed outside of very special cases such as online advertising. 

The tone of this book is entirely different: we pose an optimal control problem, and show how 
to obtain an approximate solution based on design of exploration strategies and tuning rules. This 
is not an eccentricity of the author, but a disciplined and accepted approach to derive all of the 
standard approaches to reinforcement learning. In particular, the Q-learning algorithm of Watkins 
and its extensions are designed to solve or approximate the “dynamic programming” equations 
introduced in the 1950s. 

The field is young, and its future may look something like the dream you had in mind before you 
read this preface. I hope that in the near future we will discover new paradigms for RL, perhaps 
drawing inspiration from intelligent living beings rather than optimality equations from the past 
century. I have confidence that the fundamental principles in this book will remain valuable without 
the shackles of the optimal control paradigm! 


Acknowledgements Let’s begin three decades back: In the mid 1990s I (figuratively) won the 
lottery: a Fulbright fellowship that took me and my family to Bangalore, India, including my 
young daughters Sydney and Sophie. Those nine months at IISc with Vivek Borkar were the start 
of fruitful collaborations and a long lasting friendship. Vivek’s presence can be felt on nearly every 
page of the second half of this book. 

I was also fortunate to have interactions with Ben Van Roy while he was completing his disser- 
tation research at MIT. His work with John Tsitsiklis is an absolute tour-de-force. Many aspects 
of the book draw from this early RL research. His current research is likely to have similar long 
lasting impact. 

Prashant Mehta once said to me I know how you do it! You surround yourself with amazing 
people! Amazing is right, and he is right there at the top. This book is a product of collaborations 
with Vivek, Prashant and many others including Ana BuSic, Ken Duffy, Peter Glynn, Ioannis 
Kontoyiannis, Eric Moulines, and many old friends at UTRC including Amit Surana and George 
Mathew. My PhD advisor Peter Caines was my first, and one of my all-time best colleagues, who 
enthusiastically supported my first investigations into Markov chain theory. This set the stage for 
collaborations with Richard Tweedie that began during my postdoctoral stay at the Australian 


Pre-publication draft -- March 25, 2022 


CONTENTS 3 


National University. All amazing people, anyone will agree. 

Younger stars that influenced my research include Shuhang Chen, who recently defended his 
dissertation in the math department and is lead author of [59] on finer ODE methods. Many thanks 
to current graduate student Fan Lu for comments on early drafts, and assistance with numerical 
experiments. 

Prabir Barooah helped to draw me to the University of Florida, from my former home at the 
University of Illinois. I’ve benefited from our interactions, and with his students including Naren 
Raman and Austin Coffman. 

Max Raginsky helped me to navigate the literature outside of my usual orbit. In particular, 
his advice along with Polyak’s recent survey [136] helped me to understand the contributions from 
the USSR in the early days of RL and SA. Max’s research is also an inspiration: while references 
to his work are scattered throughout the book, much of this material is suited for a more advanced 
monograph. 

Much of Chapters 2 and 3 is based on the state space control course created at the Decision and 
Control Laboratory at the University of Illinois. Many thanks to Bill Perkins, Tamer Basar, and 
Max Raginsky for allowing me to borrow material from the course manuscript [29], and laboratory 
manager Daniel Block for leading the design of innovative control experiments. 

In 2018 I was fortunate to spend several months at NREL, where I conducted research at 
the Autonomous Energy Systems laboratory. One outcome of these interactions was research on 
stochastic approximation, leading to the articles [40, 41, 93, 86, 87, 88]. The book would not be 
the same without collaborations at NREL with Andrey Bernstein, Marcello Colombino, Emiliano 
Dall’Anese, and my former graduate student Yue Chen. 

In reviewing the literature on extremum seeking control for Chapter 4, I was skeptical of the 
common claim in the research literature that the idea began in the 1920s. The most convincing 
case for this history was made in [348]. I contacted co-author Iven Mareels who reassured me that 
this history was accurate. Then, with help from colleagues in France, I found and translated the 
1922 document [217] that is considered the source of this optimization technique. 

One of the great ‘bridge builders’ at the intersection of RL and control theory is Frank Lewis, 
who has led the creation of several collected volumes on these topics. I was surprised when he 
thought of me one decade ago, leading to the contribution [165], and very enthusiastic when he 
invited me to contribute to a new volume one decade later [110]. 

Until recently I have regarded RL as a hobby, as motivation for simplified models for complex 
systems (such as networks [254]), and as a vehicle to teach control theory. This changed with the 
arrival of Adithya Devraj to the University of Florida, where he pursued graduate studies with me 
until he graduated and departed for Stanford in the Spring of 2020. His curiosity and intellect were 
an inspiration in many ways, and in particular drove me to learn more about the evolution of RL 
over the past decade. Many of the figures and much of the theory in the second part of this book 
are taken from his dissertation [107]. He also provided suggestions that improved many parts of 
the book. 

I owe the Simons Institute a great debt. During the Spring of 2018 I was a long term visitor 
during their program on Real-time decision making, and was doubly fortunate to be joined by Ana 
Busié and Adithya Devraj. We learned a great deal from fellow visitors, and Peter Bartlett (along 
with other locals). Our discussions back then helped to motivate the 2020 program on RL which 
provided a massive crash course on every aspect of the subject, with a strong emphasis on the sort 
of bridge building I am attempting to pursue with this book. In the fall of 2020 I watched tutorials 
on recent actor-critic techniques just before finishing Chapter 10 on this very topic. The book 
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benefited from the rltheory virtual seminar series, organized by Gergely Neu, Ciara Pike-Burke, 
and Csaba Szepesvari,? which was also inspired by the 2020 RL program at Simons. 

Returning to the present: In the spring of 2021 I created a new course based on part 1 of 
this book. Many of the students were hungry for an accessible treatment of both control systems 
and RL, and survived the sometimes difficult and rocky three months. I appreciate all of the 
feedback I received over the semester, and did my best to respond. You can thank Arielle Stevens 
for correcting many confusing passages in the first three chapters, and for the gray boxes used to 
highlight important concepts. Many more improvements were made in response to input from other 
students, including Caleb Bowyer, Bo Chen, Austin Coffman, Chetan Dhulipalla, Weihan Shen, 
Zetong Xuan, Kei-Tai Yu, and Yongxu Zhang. Also on this list is the recent graduate Dr. Bob 
Moye and current graduate students conducting research with me on RL and related topics: Mario 
Baquedano Aguilar, Caio Lauand and Amin Moradi. 

I also received substantial feedback from current PhD candidate Vektor Dewanto soon after a 
draft manuscript was posted on Twitter in August of 2021. 

Of course, I cannot forget my sponsors. Bob Bonneau at AFOSR encouraged funding for early 
research with Prashant Mehta on Q-learning, mean-field games, and nonlinear filtering. Derya 
Cansever and Purush Iyer at ARO have funded more recent research on related topics. The National 
Science Foundation has funded my most abstract and seemingly worthless research topics, which 
hopefully led to something of value. My most reliable ally at NSF was Radhakisan (Kishan) Baheti, 
who recommended funding for my very first grant (on the topic of adaptive control, at the start of 
the 90s). He was a fantastic mentor, alert to potentially foolish ideas, and also inspired by new if 
potentially useless research directions. He knew what everyone in the control community was up 
to! He also inspired all of us with his marathon runs and mastery of yoga. 


— Sean Meyn, August 1, 2021 


*https://sites.google.com/view/rltheoryseminars/home 
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Chapter 1 


Introduction 


To define reinforcement learning (RL) it is first necessary to define automatic control. Examples in 
your everyday life may include the cruise control in your car, the thermostat in your air-conditioning, 
refrigerator and water heater, and the decision making rules in a modern clothes dryer. There are 
sensors that gather data, a computer to take the data to understand the state of the “world” (is the 
car traveling at the right speed? Are the towels still damp?), and based on these measurements an 
algorithm powered by the computer spits out commands to adjust whatever needs to be adjusted: 
throttle, fan speed, heating coil current, or .... More exciting examples include space rockets, 
artificial organs, and microscopic robots to perform surgery. 

The dream of RL is automatic control that is truly automatic; without any knowledge of physics 
or biology or medicine, an RL algorithm tunes itself to become a super controller: the smoothest 
ride into space, and the most expert micro-surgeon! 

The dream is surely beyond reach in most applications, but recent success stories have inspired 
industry, scientists, and a new generation of students. 

DeepMind’s AlphaGo set the world on fire following the defeat of Go champion Fan Hui in 2015, 
and Lee Sedol the following year (story told in the 2017 film [185]). Soon after was the astonishing 
sequel AlphaZero, which learns to play Chess and Go by “self play” without any help from experts 
(£50; 322, S00)." 


1.1 What You Can Find in Here 


Reinforcement learning today rests on two pillars of equal importance: 


1. Optimal control: the two most famous RL algorithms, TD- and Q-learning, are all about 
approximating the value function that is at the heart of optimal control. Similarly, actor-critic 
methods are based on state-feedback, which is motivated by optimal control theory. 


2. Statistics and information theory, especially the topic of exploration, as in bandit theory. 
Consider the annoying ads on YouTube, which provide one example of Google’s exploration: 


3 These success stories follow a longer history of demonstrations based on board games. An earlier breakthrough is 
Tesauro’s implementation of TD-learning with neural network function approximation to obtain an RL algorithm for 
Backgammon [350]. The 1997 victory of IBM’s “Deep Blue” over chess champion Garry Kasparov was also front-page 
news [373]. While this algorithm resembles model predictive control rather than any RL algorithm described in this 
book, we will be liberal in our taxonomy of control and RL since we want to benefit from the best of both. 


Pre-publication draft -- March 25, 2022 


CHAPTER 1. INTRODUCTION 6 


“will Diana click???” [307, 171, 216]. Exploration in RL is a rapidly evolving field—it will 
surely generate many new books in the years to come. 


The big focus of the book is the first pillar, emphasizing the geometry of optimal control, and 
why it should not be difficult to create reliable algorithms for learning. We will not ignore the 
second pillar: motivation and successful heuristics will be explained without diving deeply into the 
theory. The reader will learn enough to begin experimenting with homemade computer code, and 
have a big library of options for algorithm design. Before completing half of the book, I hope that 
a student will have a solid understanding of why these algorithms are expected to be useful, and 
why they sometimes fail. 

This is only possible through mastery of several fundamentals: 


(i) The philosophical foundations of control design 

(ii) The theory of optimal control 
(iii) ODEs: stability and convergence. The ODE method, including translation to algorithm. 
(iv) Basics of machine learning (ML): function approximation & optimization theory. 


Any reader who knows the author will wonder why the list is so short! The following topics are 
seen as fundamental in RL theory, and are fundamental to much of my research: 


(i) Stochastic processes and Markov chains 
(ii) Markov decision theory 
(iii) Stochastic approximation and convergence of algorithms 


Yes, we will get to all of this. However, I want to make it clear that there is no need for probability 
theory to understand many of the important concepts in RL. 

The first half of the book contains RL theory and design techniques without any probability 
pre-requisites. This means we pretend that the world is purely deterministic until the surname 
“Markov” appears in the second half of the book. Justification comes in part from the control 
foundations covered in Chapters 2 and 3. Do you think we modeled the probability of hitting a 
seagull when we designed a control system to take astronauts to the moon? The models used in 
traditional control design are often very simple, but good enough to get insight on how to control 
a rocket or a pancreas. 

Beyond this, once you understand RL techniques in this simplified setting, it does not take 
much work to extend the ideas to the more complex probabilistic realm. 

Among the highlights of Part I of the book are 


> ODE design. The ODE (ordinary differential equation) method has been a workhorse for 
algorithm analysis since the introduction of the stochastic approximation technique of Robbins and 
Monro in the early 1950s [301]. In this book we flip the narrative, and start off in the continuous 
time domain. There is tremendous motivation for this point of view: 


(i) We will see that an ODE is much simpler to describe and analyze than the discrete-time 
noisy algorithm that is eventually implemented. 


(ii) Remarkable discoveries in the optimization literature reinforce the value of this approach: 
first design an ODE with desirable properties, and then find a numerical analyst to implement 
this on a computer [318]. It is now known that the famous acceleration techniques of Polyak 
and Nesterov can be interpreted in this way. 
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(iii) Zap Q-learning will be one of many algorithms described in the book. It is a particular ODE 
design based on the Newton-Raphson flow introduced in the economics literature, and first 
analyzed by Smale in the 1970s [325]. The Zap ODE is universally stable and consistent, and 
hence when translated with care, provides new techniques for RL design [89, 112, 110, 107]. 
The power of Zap design is illustrated in Fig. 1.1—see Section 9.8.2 for details. 


Watkins, Speedy Q-learning, 


Bellman Error 


5 
) 1 2 3 4 5 6 7 8 9 40 7x10 


Figure 1.1: Maximum Bellman error {B,, : n > 0} for various Q-learning algorithms [107]. 


> Quasi-stochastic approximation. The theory of “stochastic approximation” amounts to justifying 
a discrete time algorithm based on an ODE approximation. An understanding of the general theory 
requires substantial mathematical background. 

The reader will be introduced to stochastic approximation, without any need to know the 
meaning of “stochastic”. This is made possible by substituting mixtures of sinusoids for randomness, 
which is justified by an emerging science [228, 11, 52, 51, 93, 88]. Not only is this more accessible, 
but the performance in application to policy gradient methods in RL is remarkable. 

The plots shown in Fig. 1.2 are based on experiments described in Chapter 4, comparing explo- 
ration using sinusoids vs. traditional random “i.i.d.” exploration. The histograms were generated 
through 1000 independent experiments. The traditional approach labeled “I1SPSA” requires addi- 
tional training of many orders of magnitude when compared to QSA. 
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Figure 1.2: Error analysis for two policy gradient algorithms for Mountain Car, using QSA and traditional random- 
ized exploration. 


> Batch RL methods and convex Q-learning. One of the founders of AlphaGo admits that extension 
of these techniques is not trivial: “This approach won’t work in more ill-structured problems like 
natural-language understanding or robotics, where the state space is more complex and there isn’t 
a clear objective function” [150]. 
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In applications to controlling building systems, the energy grid, robotic surgery, or autonomous 
vehicles, we need to think carefully about more structured learning and control architectures, 
designed so that we have a reliable outcome (hopefully with some guarantees on performance as 
well as the probability of disaster). The basic RL engine of AlphaZero is Deep Q-Learning (DQN) 
[261, 260, 259]: a batch Q-learning method that is easily explained, and offers great flexibility to 
allow the inclusion of “insights from experts”. Completely new in this book are convex-analytic 
approaches to RL that have a far stronger supporting theory, and can be designed to impose bounds 
on performance. 


Part II might be considered a more traditional treatment of RL, since it begins with Markovian 
models and surveys the original TD methods developed in the 1990s. What is unique in this book 
is the attention to algorithm design, including ways to obtain insight in selecting ‘meta-parameters’ 
such as step-size gains. There is also new material, such as variance reduction methods (also known 
as acceleration techniques) for both Q-learning and actor-critic methods. 


1.2 What’s Missing? 


The focus of the book is on control fundamentals that are most relevant to reinforcement learning, 
and a large collection of tools for RL algorithm design based on these fundamentals. 
Important topics that will receive less attention: 


(i) Exploration. This is a hot topic for research right now [216, 307, 171, 281, 245]. The theory is 
mature mainly within the sub-discipline of bandit theory. This book contains only a mention 
of bandit basics to explain the exploration/exploitation tradeoff in RL. 


(ii) Data science. Here I refer to the empirical side of statistics and computer engineering. Much 
success in RL may have been initially inspired by hardcore mathematical analysis, but success 
in applications requires clever computer engineers to create efficient code, and patience to test 
algorithms and hopefully improve them based on insight gained in particular examples. 

There are no large-scale empirical studies described in this book. Also, there is no attempt 
to catalog the growing list of RL algorithms. 


(iii) Machine learning topics. The book will explain the meaning of neural networks and kernels, 
but ask the reader to go elsewhere for details. 


Sample complexity theory is covered only briefly. There is no question that sample complexity 
theory is the bedrock of statistical learning theory, and the theory of bandits. However, I personally 
believe that its value in RL is limited: bounds are typically loose, and to date they offer little insight 
for algorithm design. For example, I don’t see how today’s finite-n bounds will help DeepMind 
create better algorithms for Go or Chess. 

The value of asymptotic statistics is under-appreciated in the RL literature. The best way to 
make this point is through images. Fig. 1.3 is taken from [114] (many similar plots can be found in 
the thesis of A. Devraj [107]). Technical details regarding these plots may be found in Section 9.7. 

The histograms show estimation error for a single parameter (one of many) using a particular 
implementation of tabular Q-learning. The integer N refers to the run-length of the algorithm, and 
the histograms were obtained from 1,000 independent experiments. The “theoretical density” is 
what you can obtain from asymptotic statistics theory for stochastic approximation. This density 
is easily estimated based on limited data. In particular, in this experiment, it is clear after N = 104 
samples that we have a very good variance estimate. This can then be used to obtain approximate 
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Figure 1.3: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest path problem 
of [112]. The relative Q-learning algorithm is unaffected by large discounting. 


confidence bounds after we run to N = 10’ (we know how far we have to run once we have a 
variance estimate). 

Variance theory for SA is used to decide run-lengths. It is also used for algorithm design: 
for example, adjust variables in an algorithm so that the asymptotic variance is minimized. Zap 
stochastic approximation and Polyak-Ruppert averaging are two examples of this approach. Outside 
of the bandits literature, J am not aware of work in the sample complexity literature that can offer 
similar value for algorithm design in reinforcement learning. 


1.3. Resources 


Many of the examples introduced in Chapters 2 and 3 are adapted from Lecture Notes on Control 
System Theory and Design [29]. The early chapters provide useful linear algebra background, but 
far more control theory than is needed to follow this book. 

There are many great textbooks on deterministic control systems. See [15] and Murray’s two- 
part online survey [268] for basics of modeling and design, and much more on linear control systems 
in [205, 7, 77]. Bertsekas [46, 45] provides a great treatment of optimal control for nonlinear systems, 
and an introduction to RL with a perspective that is similar to this book. 

Luenberger [231] is my favorite introduction to optimization, and Boyd and Vandenberghe [74] 
is considered the bible, with far more depth in the domain of numerical methods. See also the new 
monograph of Bach [19]. 

The book of Sutton and Barto [338] is an encyclopedic introduction to RL. The books [47, 
44, 347] are more foundational, and [289] is complementary to the treatment of RL here. The 
monographs [221, 360] contain essays by researchers at the intersection of RL and control. Finally, 
don’t forget the resources at the Simons Institute website, simons.berkeley.edu. Videos and slides 
from the 2020 fall program on RL will be available whenever you have time to view a tutorial. 
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Chapter 2 


Control Crash Course 


A single chapter can hardly do justice to the amazing universe of control theory and practice. The 
textbook [15] gives an accessible introduction to the philosophy and practice of control, and is also 
full of history. I was fortunate to be at the Simons Institute at Berkeley, California, when one of 
the authors presented a two-part survey of ideas in the book.’ These lectures are a great starting 
point if you are new to control systems, and will inspire many old-timers. 


2.1 You Have a Control Problem 


You surely have encountered control problems in your daily life. If you know how to drive, then 
you know what it is like to be part of a control system: 


y The observations (also called the “output”) refers to the data you process in order to effec- 
tively maneuver the car through traffic: this includes your view of the streets and lights, and 
the sounds of angry drivers pleading with you to adjust your speed. 


wu You apply inputs to the system: steering wheel, brakes and gas pedal are continuously 
adjusted based on your observations. 


This symbol will be used to denote an algorithm that takes in the observations y and produces 
the response u. This mapping from y to u is known as a policy, and sometimes called a feedback 
law (the Greek letter is pronounced “fee” ).° 


ff You are not simply reacting to horns and lights and the lines on the road. You started off 
with a plan: get to the farmers market by 9am, while avoiding the traffic downtown due to 
the demonstration. This planning is an example of feedforward control. Planning is based on 
forecasts, so inevitably plans will change as you gather information en route: traffic updates, 
or an invitation from a close friend to park your car and join the protest. 


The feedforward component is typically defined with attention to a reference signal r. The 
primary control objective is the tracking problem: construct a policy so that some object of interest 
z(k) is approximately equal to r(k) for all k > 0 (in control courses, it is often assumed that z = y). 

The yelling and bumps on the road are collectively known as disturbances. Along with the 
reference signal, partial measurements of disturbances and their forecasts are taken into account 


“nttps://simons.berkeley.edu/talks/murray-control-1 
°My apologies to those accustomed to the symbol 7. This is reserved for the irrational number. 
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in both the feedforward and feedback components of the control system. The final input is often 
defined as the sum of two components: 


u(k) = ug(k) + up(k) (2.1) 


where in the shopping problem, ug quantifies the results of planning before heading to the mar- 
ket (perhaps with updates every 20 minutes), and uf is the second-by-second operation of the 
automobile. 

The dream of RL is to mimic and surpass the skill in which humans create an internal algorithm 
o, and use it to skillfully navigate through complex and unpredictable environments. 
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Figure 2.1: Control systems contain purely reactive feedback, as well as planning that is regularly updated. This 
represents two layers of feedback, differentiated in part by speed of response to new observations. These observations 
are often limited, so that we require estimates @ of a partially “hidden” state process zx. 


Fig. 2.1 shows a block diagram typically used in model-based control design, and illustrates a few 
common design choices: there is a state to be estimated using an observer, with state estimates 
denoted z. The block denoted trajectory generation constructs two signals: the feedforward 
component of the control, and also a reference Xye¢ that an internal state should track (the state is 
associated with the physical process). It is designed so that x(k) = Xyep(k) for all k implies that 
the tracking problem is solved. The state feedback is designed to achieve this ideal. 

There is a larger “world state” labeled environment for which partial measurements are avail- 
able, and forecasts of future events. Forecasts are of course important in the planning process that 
is part of trajectory generation. 

Design of the three grey blocks is based on models of the process, the measurement (or sensor) 
noise w, disturbances (such as the “input disturbance” d indicated in the figure), and a model of the 
environment. The “A-feedback loop” is a standard way to represent model uncertainty associated 
with the process to be controlled. This feature may be motivated by an unfortunate story. 


Failure of an adaptive control system Beginning in the 1950s, control theorists in partnership 
with the U.S. Air Force looked for model free approaches to flight control. From this came the “MIT 
rule”, which may be regarded as an early attempt at adaptive control or “actor-only” reinforcement 
learning. Analysis of the MIT rule in [241] is based on techniques similar to the ODE method that 
is a foundation of this book. See [280] for a more recent study. 

Preliminary simulations showed promise, as did field tests on the X-15 airplane. Some quotes 
from the 1970 report [300] hint at the enthusiasm of scientists and pilots involved: 1) Nearly in- 
variant response was provided at essentially all aerodynamic flight conditions 2) accurate a priori 
knowledge of aircraft aerodynamic characteristics was not required to design a satisfactory system 
3) aircraft configuration changes were adequately compensated for 4) the dual redundant concept 
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provided a reliable and fail-safe system. It was also noted that the adaptive control system in- 
spired confidence and allowed the pilot to spend time cross-checking flight instruments, checking 
subsystems, and “sight-seeing”. 

These observations followed 65 test flights. 

The control system was not robust enough to provide stable control in all situations encountered. 
Sadly, a pilot died in a crash attributed to oscillations induced by the adaptive system. The research 
program was shut down, but the tragedy inspired greater attention to robustness in control design. 


It should go without saying: every control engineer or practicing economist must 
study failures. Airplanes and economies inevitably crash. In the long run, it is a greater 
tragedy if the experts do not bother to learn from disaster. 


2.2. What To Do About It? 


The vast literature on control solutions is built upon a model of input-output behavior that is used 
to design the policy @. Modeling and control design are each an art-form, with many possible 
solutions from vast statistics and control tool chests. 

When we say model, we mean a sequence of mappings from inputs to outputs: 


y(k) = Gz(u(0), u(1), u(2),..., u(k)), k>0 (2.2) 


Each of the functions G; may also depend on exogenous variables (outside of our control), such 
as the weather and traffic conditions. And here we come to one of the most vital principles of 
control design: the model must capture essential properties of the system to be controlled, and 
simultaneously simple enough to be useful. 

For example, aerospace engineers will create absurdly simple models for the design of flight 
control systems, and from this create a policy designed to work well for the model. Of course, 
they do not stop there. The next step is to create an entirely new model for validation, and simulate 
under a range of scenarios in order to answer a range of questions: what happens when the plane 
is full, empty, or flying through a thunderstorm? How does the control system perform after an 
engine detaches from a wing? If one of these tests fails, then the control engineer goes back to 
either improve the model, improve the policy, or improve the airplane. That’s right: we may require 
additional sensors to measure pitch angles, or more powerful motors to control ailerons, flaps or 
elevators. 

Iam writing without any knowledge of aerospace engineering. I am describing general principles 
for anyone interested in control design: 


1. Create a model for control. 
2. Design the policy @ based on the model 
3. Simulate based on a high-fidelity model, and then revisit steps 1 and 2. 


The success of this approach has been tremendous, as seen in the history recounted in [15]. 
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Linear and time invariant model. The most successful class of absurdly simple models 
are linear and time invariant (LTI). The general scalar LTI model is defined by a sequence of 
scalars {b;} (the impulse response), and for a given scalar input sequence wu, the model defines 
y(k) as the sum 


k 
sk) oe buh) k= 0 (2.3) 
I 


This is in fact too complex in many situations. A more tractable sub-class of LTI models are 
auto-regressive moving-average (ARMA): for scalar coefficients {a;, b;}, 


N M 
y(k) =— So ay(k—-a+ >) bulk—-t), k>0 (2.4) 
coal i=0 


A linear input-output model motivates the design of a policy @ that has a similar linear form. 
A common design technique based on optimization will be described in the following chapters, 
beginning in Section 3.6. 


2.3. State Space Models 


2.3.1 Sufficient statistic and the nonlinear state space model 


In statistics, the term sufficient statistic is used to denote a quantity that summarizes all past 
observations. The state plays an analogous role in control theory. 

A state space model requires the following ingredients: the state space X on which the state x 
evolves, and an input space (or action space) denoted U on which the input u evolves. We may 
have additional constraints coupling the state and the input, which is modeled via 


u(k) € U(x), when z(k) =x EX (2.5) 


with U(az) C U for each x. We might also want to model an observation process y evolving on 
a set Y. In the control theory literature it is common to assume that X, U, and Y are subsets 
of Euclidean space, while in operations research and reinforcement learning it is more common to 
assume these are finite sets. Whenever possible, in this book we prefer the control perspective so 
that we can more easily search for structure of control solutions: for example, is an optimal input 
a continuous function of the state? 

Next we require two functions F: X x U > X and G: Xx U > Y that define the state equations: 


a(k+1)=F(a(k),u(k)), — &(0) = x0 (2.6a) 

y(k) = G(x(k), u(k)) (2.6b) 

An LTI model can often be transformed into a state space model in which the two functions F,G 
are linear in (z, u). 


We might also allow F,G to depend upon the time variable k. It is argued in Section 3.3 that 
it is often more convenient to simply assume that the state x(k) includes k as one component. 
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However, there is one example of a time-dependent model that highlights the role of state as a 
sufficient statistic: The general input-output model (2.2) always has a state space description, in 
which the state is the full history of inputs: 


x(k +1) = [u(0), u(1), u(2),..., u(k)]T (2.7) 


We have x(k + 1) = Fy (x(k), u(k)), defined by concatenation, and y(k) = G,z(a(k),u(k)) is a 
restatement of (2.2). For this deterministic model in which the input fully determines the output, 
(2.7) is called the (full) history state. A practical state space model can be regarded as a compression 
of the history state.° 

In many cases we can construct a good policy via state feedback, u(k) = (a(k)), for some 
od: R” — R; in stochastic control, it is typical to say that is a Markov policy in this case. 
However, the power of this approach is fully realized only if we are flexible in our definition of 
the state. We won’t be using the full history state because of complexity; what’s more, the “full 
history” may not be nearly rich enough. 


2.3.2 State augmentation and learning 


Tracking and disturbance rejection are two of the basic goals in control design. Here we provide a 
brief glimpse of two tricks used to simultaneously track the reference r while rejecting disturbances: 


(i) The definition of state is not sacred: invent a state process that simplifies control design 


(ii) Unknown quantities, including disturbances and even the state space model, can be learned 
based on input-output measurements. 


Let’s maintain our simplifying assumption that the input and output are scalar valued, and 
take X = R”. The state evolution is also influenced by a scalar disturbance d that is outside of our 
control, which requires a modification of (2.6a): 


x(k +1) = F(2(k), u(k), d(k)) (2.8) 
The ultimate goal is to achieve these three objectives simultaneously: 
(a) Tracking: with y(k) = y(k) —r(k), 


lim sup |9(k)| = €c , with Cg = 0, or very small (2.9) 
k-0o 


(b) Disturbance rejection: The error €., is not highly sensitive to the disturbance d. 


(c) Tuned transient response (you probably know what kind of acceleration “feels right” when 
driving a car). 


A common special case is when the reference and disturbance are assumed independent of time 
(e.g., driving at constant speed with a steady headwind). In this special case, suppose in addition 
that the disturbance is known. We might choose u(k) = (2(k),7r(0),d(0)), where the policy 


°See [337] and its references. 
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is designed for success: e,, = 0. Typically, is designed so that the state is also convergent: 
x(k) + x(co) as k + co. The limiting values must satisfy 


(00) = F(x(oo), u(oo), d(0)) 
u(oo) = &(#(ce), (0), d(0)) 


The outcome ex = 0 is expressed as the final constraint: 
r(0) = y(co) = G(x(oo), u(oo)) 


This approach is thus dependent on an accurate model, as well as direct measurements of d. 
Suppose that instead of exact measurements of the disturbance, we have a state space model 
whose output is ym(k) = (r(k), d(k))T: 


2(k +1) = Fm(2(k)) (2.10a) 
tinh) = Gal 2h) (2.10b) 


where z evolves on R? for some integer p > 1. The functions F,,: R? — R? and G,,: R? > R are 
assumed known. Part of this state description is d(k + 1) = d(k) if the disturbance is static. 
Given the larger state space model (2.8, 2.10), we might opt for an observer based solution: 


u(k) = b(x(k), r(k), d(k)) 
where {d(k)} are estimates of the disturbance, based on input-output measurements up to time k 
(we might replace x(k) with <(k) if we don’t directly observe the state). Observer design makes up 
about 20% of a typical introductory course on state space control systems [7, 77, 29]. 

A second option, called the Internal Model Principle, is to create a different state augmen- 
tation that is entirely observed. For the sake of illustration, consider again the case of constant 
reference/disturbance. We have (2.10) in this case with z(k) = ym(k), and Fy, is the identity 
function: 

z(k + 1) = 2(k) 
State augmentation is performed based on this model: define for each k, 
2'(k+1) = 2'(k) + 9(k), (2:11) 


with error 9(k + 1) defined above (2.9). We regard (2(k), z'(k)) as the state for the purposes of 
control, and hence state feedback takes the form 


u(k) = b(a(k), 2"(k)) (2.12) 


The control design (2.12) is an example of integral control, since z’ is the sum of errors (the 
discrete-time analog of integration). 

Suppose that z'(k) converges to some finite limit z'(oo), as k — oo; the value of the limit is 
irrelevant. This and eq. (2.11) imply perfect tracking: 


dim g(k) = lim [2"(k +1) — 2'(k)] = 0. 


This conclusion is remarkable: to obtain perfect tracking, we only require that the policy ¢ is 
designed so that z‘(k) converges to some finite limit. The secret to success is a hidden element of 
“learning” that comes with integral control. 

State augmentation has many other dimensions. If we have forecasts of significant disturbances, 
then it may be wise to make use of this data: forecasts can be used in the design of the feedforward 
component ug(k) in the decomposition (2.1), or they may be used for state augmentation. 
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2.3.3 Linear state space model 
When F and G are linear we obtain the linear state space model: 


a(k +1) = Fa(k) + Gu(k), 2(0) = gp (2.13a) 

y(k) = Ha(k) + Eu(k) (2.13b) 

where (FG, H, E) are matrices of suitable dimension (in particular, F’ is n x n for an n-dimensional 
state space). 

The state space model is not unique, in the sense that there are many choices for (FG, H, F) 

that result in the same input-output behavior, even though the definition of the state process x 


will change depending on the model. And never forget: we may add additional components to x(k) 
as a means to solve a control problem. 


Linear state feedback The linear model (2.13) is often constructed so that the goal is to keep 
x(k) near the origin—the regulation problem; consider for example flight control, where we wish 
to maintain velocity and altitude at some constant values. We first normalize the problem so that 
these constant values are zero. It is then common to apply a linear control law: 


u(k) = —Kz2(k) (2.14) 


where K is called the gain matrix. To evaluate choice of gain, we tack-on something like a reference 
signal: 
u(k) = —Ka(k) + v(k) 


The closed loop behavior with new “input v” has a similar state space description: 


g(k+1) = (F —-GK)ax(k) + Gu(k), (0) = a6 (2.15a) 
y(k) = (C — EK)a(k) + Ev(k) (2.15b) 


The signal v(k) appearing in (2.15a) is viewed as an “input disturbance”. A goal of control is to 
choose K so that the closed loop behavior is not very sensitive to this disturbance, while simulta- 
neously ensuring good tracking. 


Realization theory The ARMA model (2.4) admits an infinite number of distinct state space 
descriptions). Let’s begin with the scalar auto-regressive model: 


N 
y(k) =— So aiy(k—i)+u(k), k>0 
i=1 


which is (2.4), with M = 0 and bo = 1. We obtain the state space model (2.13) with n = N by 
choosing x(k) = (y(k),...,y(kK —N+1))', and 


=O) Sty —G). 9: Ges Say 1 

1 0 QO sek tet | 1 
0 1 O: sek 26 0 0 
0 0 1 


: | G= 0 (2.16) 
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H=/[1, 0, 0;++,0], and 2=—0. 
This construction can be generalized: with M = N — 1 in (2.4), we first define an intermediate 
process 


N 
2k) =— az(k—-i)+u(k), k>0 (2.17) 
i=1 


So that we arrive at a state space model with state space r(k) = (z(k),...,z(k —- N + 1))™ to 
describe the evolution of z. We next use the assumption that M = N — 1: setting u(k) = z(k) =0 
for k < 0, it is possible to show that 


N-1 
y(k) = 5° bj2(k — 4) = Hak) + Eu(k) 
i=0 
where 
H = (bo, b1,---, bv-il, E=0 (2.18) 


The state space description (2.16,2.18) is known as controllable canonical form. There are many 
other “canonical forms”, with special properties you can learn about in a linear systems course 
(29, 205,. 7,0 %;. 15}. 


2.3.4 A nod to Newton and Leibniz 


In many engineering applications it is best to start off in continuous time, with thanks to Newton 
and Leibnitz for bringing us calculus. 

Some notational conventions reserved for continuous time: First, time is denoted using a sub- 
script (such as u,z rather than u(t)) as a reminder that time is continuous. Moreover, it is often 
convenient to suppress time dependency altogether, so that fu represents the derivative at an 
unspecified time. 

The state space model in continuous time has the form 


He = f(x,u) (2.19) 


where «x is the state evolving in R”, and u the input evolving in R™. The motion of a typical 
solution to a nonlinear state space model in R? is illustrated in Fig. 2.2. 


Figure 2.2: Trajectory of a nonlinear state space model in two dimensions: at any time t, the velocity 4a, is a 
function of the current state x; and input ut. 


When the function f appearing in (2.19) is linear, then we obtain the linear state space model 
in continuous time. As in (2.13), this is accompanied by an observation process y evolving on R?: 


fe = Ax+ Bu (2.20a) 
y=Cx+Du, (2.20b) 


Pre-publication draft -- March 25, 2022 


CHAPTER 2. CONTROL CRASH COURSE 19 


and A, B,C, D are matrices of appropriate dimensions. 

The geometry illustrated in Fig. 2.2 is sometimes valuable in gaining intuition in control design 
(note that the vector f(x;, uz) is tangent to the state trajectory). Stability theory and optimal 
control theory are most attractive in the continuous time domain because of this simple geometry, 
and the simplicity that comes with calculus. 

However, in the end we have to sample time to apply our control and learning algorithms. 
In this book we will opt for an Euler approximation. For sampling interval A, the discrete time 
approximation of (2.19) is of the form (2.6a), with F(z,u) = «+ Af(z,wu). For the linear model 
(2.20a) this leads to F =I + AA. 


2.4 Stability and Performance 


In this section we consider the state space model (2.6a) in closed loop: a policy ¢ is chosen, so that 
u(k) = @(a(k)) for each k. Since the feedback law is fixed, the state then evolves as a state space 
model without control. With just a slight abuse of notation we write 


a(k+1)=F(2(k)), k>0 (2.21) 


Our interest is in the long-run behavior of the state process; in particular, does it converge to an 
equilibrium? We also seek bounds on a particular performance metric called the total cost. 
The following is assumed throughout: 


The state space X is equal to R”, or a closed subset (2:22) 


For example we allow the positive orthant, X = R'i. The restriction on the state space (2.22) is 
imposed so that any closed and bounded set S C X is necessarily a compact subset of X. 
The definition of an equilibrium x° is straightforward—it is a state at which the system is frozen: 


a = F(2°) (2.23) 


The equilibrium will in fact be a part of the control design. Think of the cruise control in your car, 
in which “equilibrium” means traveling in a straight line at constant speed. The particular speed 
is something that you as the driver will choose. The control system then does the best it can to 
keep x(k) near the desired value x°. 


2.4.1 Total cost 


This performance metric is based on a function c: X + R4, interpreted as the “cost function under 
policy ®”, to be considered in greater depth in Chapter 3. Based on this, we arrive at a strange but 
ubiquitous definition: the total cost is a function of x, known as the (fixed policy) value function, 
and defined as the infinite sum: 


I(x) = e(a(k)),  2(0)=2eX (2.24) 
k=0 


It is assumed that c(x°) = 0, and we seek conditions ensuring that x(k) + x° as k — oo, so there 
is some hope that J is finite valued. For the cruise control problem, the cost function is designed 
so that c(x) is large if the state x corresponds to a speed that is far from desired. 
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Why is the controls community so excited about total cost? This metric for perfor- 
mance is not very intuitive, but there are compelling reasons for using it as a performance 
metric in control design: 


(i) It is “forward looking”. One might argue that (2.24) is looking too far foward (who 
cares about infinity), but there is implicit “discounting” of the future since for a good 
policy we have c(x(k)) — 0 quickly as k > oo. 


(ii) Theory for total cost optimal control is often closely related to average cost optimal 
control—to be seen in Part 2 of the book. 


(iii) If J is finite valued, then stability is typically guaranteed. 


Benefit (iii) is the most abstract, but the most valuable aspect of this performance metric. 
Section 2.4.2 is dedicated to stability theory and its relationship to value functions. A part of this 
theory is based on the (fixed policy) dynamic programming equation:’ 


J(x) = c(x) + J(F(z)) (2.25) 


This can be derived from the definition (2.24), written as 


I(x) = (a) + >) e(a*(k)) 


=0 


where x* is the solution to (2.21), starting at «t(0) = F(z). 


2.4.2 Stability of equilibria 


We survey here the most common definitions of stability for a nonlinear state space model. The 
first and most basic is a form of continuity near the equilibrium 2°. Let ¥(k; x9) denote the state 
at time & with initial condition zo: this is simply x(k), obtained recursively from (2.21), starting 
at x(0) = zo. In particular, the equilibrium property (2.23) implies that V(k; 2°) = x® for all k. 


Stable in the sense of Lyapunov. The equilibrium x® is stable in the sense of Lyapunov 
if for all « > 0, there exists 6 > 0 such that if ||zo — x°|| < 6, then 


|V(k; 20) — X(k; x) || < € for all k > 0. 


In words, if an initial condition is close to the equilibrium, then it will stay close forever. An 
illustration is provided in Fig. 2.3, with B(r) = {x € R": ||x —2°|| < r} for any r > 0. 

This is a very weak notion of stability, since there is no guarantee that the state will ever 
approach the desired equilibrium. The next definitions impose convergence: 


"Dynamic programming equation and Bellman equation are used interchangeably, in reverence to [35]. 
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Asymptotic Stability. An equilibrium 2° is said to be asymptotically stable if x° is stable 
in the sense of Lyapunov and for some dg > 0, whenever ||29 — x*|| < do, 


limivA (hog) =a. (2.26) 


k- oo 


The set of xo for which the above limit holds is called the region of attraction for x°. 


The equilibrium is globally asymptotically stable if the region of attraction is all of X: that 
is, 69 = 00, and hence x(k) > x© from any initial condition. 


It is common to say that the state space model is globally asymptotically stable. That is, it is 
often stressed that this is a property of the system (2.21) rather than the equilibrium ° € X. 

Sometimes we obtain a very fast rate of convergence: the state space model is said to be globally 
exponentially asymptotically stable if there are constants 99 > 0 and Bo < oo such that for each 
initial condition and k > 0, 


|X (520) — #°|| < Bollao — 2*||e"e* (2.27) 


2.4.3. Lyapunov functions 


The construction of a Lyapunov function V is the most common approach to establishing asymptotic 
stability, as well as bounds on a value function (and more general bounds on the state process). In 
broad generality, V is a function on X taking non-negative values, and the crucial property that 
makes it a Lyapunov function is that V(a(k)) is decreasing when x(k) is large: this is formalized 
as a drift inequality. The Lyapunov function V is often regarded as a crude notion of “distance” 
to the “center of the state space”. 

For any scalar r, let Sy(r) denote the sublevel set: 


Syir) ={# eX: Vic) <r} (2.28) 


In addition to non-negativity of V, we frequently assume it is 
inf-compact: 


{x €X:V(x) < V(x°)} is a bounded set for each x° € X 


That is, the set Sy(r) takes on one of three forms for any r: the 
empty set, Sy(r) = X, or Sy(r) C X is bounded. 
In most cases we find that Sy(r) = X is impossible, so that 


‘ : : Fi 2.3: If B(6), th 
we arrive at the stronger coercive assumption: a eens Pe BG) he 


X(k; x0) € B(e) for all k > 0. 
lim V(x) =0o (2.29) 


\|2||00 


In this case, under our standing assumption (2.22), the set Sy(r) is either empty or bounded for 
each r. Fig. 2.4 illustrates the two classes of functions with X = R (the function shown on the left 
is bounded). Here are three numerical examples: 
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(i) V(x) = 2? is coercive since (2.29) holds. 
(ii) V(x) = 2?/(1+27) is inf-compact but not coercive: Sy(r) = R for r > 1, and Sy(r) = [-a, a] 
(a bounded interval) for 0 <r <1, witha =,/r/(1—r). 
(iii) V(x) = e® is neither: Sy(r) = (—oo, log(r)] is not a bounded subset of R for r > 0. 


The value function J satisfies the intuitive properties of a Lyapunov function under mild con- 
ditions: 


Lemma 2.1. Suppose that the cost function c and the value function J defined in (2.24) are 
non-negative, and finite valued. Then, 


(i) J(x(k)) is non-increasing, and jim J(x(k)) = 0 for each initial condition. 
00 


(ii) Suppose in addition that J is continuous, inf-compact, and vanishes only at «©. Then, for 
each initial condition, jim Hea. 
—0o 


The proof is postponed to Section 2.4.4, but we note 
here the first steps: the dynamic programming equation Vis inf-compact 4 Vis coercive 
(2.25) implies that for each k > 0, 


Me V(2) 
I(a(k + 1)) = J(a(k)) — e(a(k)) < S(a(k)) (2.30) ave 


That is, J(2(k)) is non-increasing, so that x(k) € Sj(r) » » 
for each k > 0, with r = J(x(0)). The infcompact Figure 2.4: Infcompact and coercive 
assumption then implies that the state trajectory is 
“bottled-up” in the bounded set S;(r). 

In the context of total-cost optimal control, the basic drift inequality considered in this book is 
Poisson’s inequality: for non-negative functions V,c: X > R;, and a constant 7 > 0, 


V(F(a)) < V(x) — c(x) +7 (2.31) 


The reference to a French mathematician is explained in the notes. Poisson’s inequality is a relax- 
ation of the dynamic programming equation (2.25) through the introduction of 7, as well as the 
inequality. 

Poisson’s inequality is defined with attention to the dynamics: on combining (2.31) and (2.21) 
we obtain (similar to (2.30)), 


V(alk + 1)) < V(alk)) —e(a(k)) +7, 20 


If 7 = 0, it follows that the sequence {V (x(k) : k > 0} is non-increasing. Under mild assumptions 
on V, we obtain a weak form of stability: 


Proposition 2.2. Suppose that (2.31) holds with 7 = 0. Suppose moreover that V is continuous, 
inf-compact, and with a unique minimum at «°. Then the equilibrium is stable in the sense of 
Lyapunov. 


Proof. From the definition of the sublevel sets we obtain 
Msvir) sr > Vea} =Sv(r)|_ = 29 
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The final equality follows from the assumption that x© is the unique minimizer of V. The inf- 
compact assumption then implies the following inner and outer approximations: for each ¢ > 0, we 
can find r > V(x®) and 6 < € such that® 


{x EX: |x —2°|| < 6} Cc Sy(r) C {x EX: |lz — 2° || < e} 


If ||zo — x®|| < 6, then zo € Sy(r), and hence x(k) € Sy(r) for all k > 0 since V(a(k)) is non- 
increasing. The final inclusion above then implies that ||a(k) — x°|| < € for all k. Stability in the 
sense of Lyapunov follows. O 


Bounds on the value function J are obtained by iteration: for example, the two bounds 
V(x(2)) < V(x(1)) — e(2(1)) +77, V(x(1)) < V(x(O)) — e(2(0)) +77 
imply that V(a2(2)) < V(a(0)) — e(x(0)) — c(x(1)) + 27). We can of course go further: 


Proposition 2.3. (Comparison Theorem)  Poisson’s inequality (2.31) implies the following 
bounds: 


(i) For each N > 1 and x = x(0), 
N-1 
V(a(N)) + S© e(a(k)) < V(x) +NG (2.32) 
k=0 
(ii) If7 =0, then J(x) < V(x) for all x. 


(iii) Suppose that 7 = 0, and that V, c are continuous. Suppose moreover that c is inf-compact, 
and vanishes only at x©. Then, the equilibrium is globally asymptotically stable. O 


The proof is found in Section 2.4.4. 

Prop. 2.3 raises a question: what if Poisson’s inequality is tight, so that the inequality in (2.31) 
is replaced by equality? Consider this ideal with 77 = 0, and use the more suggestive notation 
V = J° for the Lyapunov function: 


J°(F(x)) = J°(a) — e(x) (2.33) 


If J° is non-negative valued, then we can take V = J° in Prop. 2.3 to obtain the upper bound 
J(x) < J°(x) for all x. Equality requires further assumptions: 


Proposition 2.4. Suppose that (2.33) holds, along with the following assumptions: 
(i) J is continuous, inf-compact, and vanishes only at x©. (ii) J° is continuous. 


Then, J (2) = J" (x) — J*(x*) for each a. 


’This conclusion requires a bit of topology: the characterization of compact sets in terms of “open coverings”. If 
this is new to you, don’t worry: topology is not a pre-requisite for this book. In the future you might want to take a 
first year mathematical analysis course. 
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2.4.4 Technical proofs 
To establish Props. 2.3 and 2.4 we first require Lemma 2.1. 
Proof of Lemma 2.1. We begin with the sample path representation of (2.25): 
J(a(k + 1)) — J(x(k)) + c(x(k)) =0 (2.34) 


Summing each side from k = 0 to N — 1 gives for each x = x(0), and each N, 


N-1 
I(x) = I(aW)) + So e(a(k)) 
k=0 
On taking limits we obtain 
N-1 
I(x) = lim {J(@W)) + S> e(a(k)) } = { lim F(a} + Je) 
k=0 


which implies (i) under the assumption that J(zx) is finite. 
The inf-compact assumption in (ii) is imposed to ensure that the state trajectory evolves in a 
bounded set: equation (2.30) implies that x(k) € Sj(r) for the particular value r = J(a(0)), and 
each k > 0. Suppose that {x(k;) : 7 > 0} is a convergent subsequence of the state trajectory, with 
limit «°°. Then, J(xz°°) = Jim J(x(ki)) = 0 follows by continuity of J. 
The assumption that J vanishes only at x© implies that 7° = x°. Part (ii) follows, since every 
convergent subsequence reaches the same value x°. O 


Proof of Prop. 2.8. The bound (2.32) is established following the discussion preceding the propo- 
sition. We begin with the sample path representation of (2.31), similar to (2.34): 


V(a(k + 1)) — Valk) + e(a(k)) <0 (2.35) 


Summing each side from k = 0 to N — 1 gives (i): 


V(a(N)) — V(x) + DL e(a(k)) < WN 


k=0 


Part (ii) follows since V(x(NV)) > 0 for each N’, so that when 7 = 0 we obtain from the preceding 
bound, 

N-1 

Y= e(a(k)) < V(a(0)) 

k=0 

The proof of (iii) is identical to Lemma 2.1: part (ii) implies that limz_,.. c(x(k)) = 0, and the 
assumptions on c then imply that x(k) > 2° as k > oo. 

It remains to show that x® is stable in the sense of Lyapunov. To see this, first observe that 
with 7 = 0, the bound (2.31) implies that V > c, so that V is also inf-compact. The bound (2.31), 
and conditions on c, 7, also imply that V(a(k)) is strictly decreasing when x(k) 4 «©. Continuity 
of V implies that V(a(k)) | V(a®) for each x(0), so that V(x°) < V(ax(0)) for all 7(0) € X. Stability 
in the sense of Lypapunov then follows from Prop. 2.2. O 
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Proof of Prop. 2.4. The proof begins with iteration, as in Prop. 2.3: 


N-1 
J°(a(N)) + $5 e(w(k)) = J°(a) 

k=0 
Lemma 2.1 (ii) and continuity of J° implies that J°(x(N)) > J°(#®) as NV > oo, which implies 
the desired identity: J°(x°) + J(x) = J°(z). O 


R” 


{(x) 


Figure 2.5: If V is a Lyapunov function, then V(a;) is non-increasing with time. 


2.4.5 Geometry in continuous time 


Let’s briefly consider an analog of (2.21) in continuous time, with state evolving on X = R”: 


eet = (22) (2.36) 
where f: R? + R? is called the vector field. It is common to suppress the time index, writing 
d 
av = f(z) 


We let X(t; x9) denote the solution to (2.36) at time t, when we need to emphasize dependency 
on the initial condition xg. The definition of asymptotic stability of an equilibrium 2° is the same 
as for the state space model in discrete time (2.21). The equilibrium is globally asymptotically 
stable if in addition, 


him Ahi) =2", for all rp € X 
too 


Verification of global asymptotic stability invites the following assumptions, generalizing the 
theory in discrete time. Recall that V: R" — R is continuously differentiable (or C') if the 
gradient VV exists and is continuous. 
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Lyapunov function for global asymptotic stability. 
> V is non-negative valued and C!. 
> It is inf-compact (recall the definition below (2.28)). 


> For any solution a, whenever 2; 4 x°, 


4V (x) <0. (2.37) 


Naturally, £V (a1) = 0 if a =2°: im this-case, V(ge4) = V(2*) for all «> 0. 

Fig. 2.5 illustrates the meaning of the vector field f for the special case X = R?, and the figure is 
intended to emphasize the fact that V(a;) is non-increasing when V is a Lyapunov function. The 
drift condition (2.37) can be expressed in functional form, 


VY (a) t(2))-< U, re ae. (2.38) 


This is illustrated geometrically in Fig. 2.6. 


(VV(z), f(x)) <0, ox x* 


Figure 2.6: Geometric interpretations of a Lyapunov drift condition: the gradient VV (a) is orthogonal to the level 
set {y: V(y) = V(«)}, which is the boundary of the set Sy(r) with r = V(z). 


Proposition 2.5. Jf there exists a Lyapunov function V_ satisfying the assumptions for global 
asymptotic stability, then the equilibrium x© is globally asymptotically stable. O 


Prop. 2.5 is a partial extension of Prop. 2.3 to the continuous time model. A full extension 
requires a version of Poisson’s inequality. Suppose that c: R"” > R, is continuous, V: R” > R, is 
continuously differentiable, and 7 > 0 is a constant, jointly satisfying 


(VV (a), f(x)) < —ce(x) +7, rEX (2.39) 
An application of the chain rule implies that this is a continuous time version of (2.31): 
£V (az) < —c(a:) +7, t> 0. 
And with a bit more work, the following conclusions: 
Proposition 2.6. Jf (2.39) holds for non-negative c,V,7, then 


ii 
V(ar) + c(az) dt < V(x) + T7, r=xreEX, T>0 
0 
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If 7 = 0 then the total cost is finite: 


[ c(az) dt < V(a2), Lo =ZEX. (2.40) 
0 


Proof. For any T > 0, we obtain by the fundamental theorem of calculus, 


Veyeves-Vays [ (ave) dt <TH- [ olan T>0. 


If 7 = 0, then the bound (2.40) follows on letting T — oo. O 


Converse Theorems We have seen this implication: 
Existence of Lyapunov function => Stability and/or performance bound 


where the nature of stability depends on the nature of the Lyapunov function bound. What about 
a converse? That is, if the system is stable, can we infer that a Lyapunov function exists? 
Assume moreover that the total cost is finite: 


Je) = [ c(xz) dt, tg =a 


with arbitrary initial condition. If J is differentiable, then we obtain a solution to (2.37) using 
V=J: 


Proposition 2.7. If J is finite valued, then for each initial condition x9 and each t, 
4 J (x1) = —c(xz) (2.41) 
If J is continuously differentiable, the Lyapunov bound (2.37) follows with equality: 
VJ(z) - f(z) = —c(z) 


Proof. We have a simple version of Bellman’s principle (a focus of Chapter 3): for any T > 0, 
T 
Tea | \GradGe 
0 
For t > 0, 6 > 0 given, apply this equation with T =t+6 and T =t: 
t+6 
Gee i Ge end Cas 
0 
t 
[a= fe arte 
0 
On subtracting, and then dividing by 6, this gives 
al t+6 i 
j=2 | eae ens Fas) 
OSs 6 
Letting 6 | 0, the first term converges to c(2;) because c: R” — R is continuous, and the second 


term converges to the derivative of J(x;) with respect to time, which establishes (2.41). The final 
conclusion follows from the chain rule. O 
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2.4.6 Linear state space models 
If the dynamics in (2.21) are linear, with x(k) € X = R”, then 
u(k +1) = Fa(k), ei) (2.42) 
for an n x n matrix F, and by iteration 
a(k) = F* x, B20, £0) =e 


This equation is valid with k = 0 since we take F° = I, the n x n identity matrix. 
Suppose that the cost is also quadratic, c(z) = a™Sax for a symmetric and positive definite 
matrix S. It follows that c(x(k)) is a quadratic function of x(0) for each k: 
c(a(k)) = (F¥x)'S Fa 


Hence the value function J defined in (2.24) is also quadratic: 


[oe] 
I(x) =2|S(FYSF*|2, — 0(0) =z EX 
k=0 
That is, J(z) = z7M a, where M is the matrix within the brackets. It satisfies a linear fixed point 
equation, known as the (discrete-time) Lyapunov equation: 


M=S+F'MF (2.43) 


A proof of the following can be obtained based on these calculations: 


Proposition 2.8. The following are equivalent for the linear state space model (2.42): 


(i) The origin is locally asymptotically stable. 


(ii) The origin is globally asymptotically stable. 
(iii) The Lyapunov equation (2.43) admits a solution M > 0 for any S > 0. 
(iv) Each eigenvalue \ of F’ satisfies |A| < 1. O 


Controllable canonical form. Recall that this state space realization was based on the 
ARMA model (2.4), with N =n. If you have taken a course in Signals and Systems, you then 
know that stability of the ARMA model (in an input-output sense called BIBO stability) is 
verified by examining the roots {p; : 1 <7 < n} of the rational function 
n n 
a(z) =1+ ae = [[c ic. zEC 
i=1 i=1 
The system is BIBO stable if |p;| < 1 for each i. The eigenvalues {A; : 1 <i < n} of F are 
obtained as the solution to a root finding problem Ap(A) = 0, where 
n 
Ap(A) =det(AI-F)=][@-A), AEC 
i=1 
For the state space model in controllable canonical form, it can be shown that A(z) = a(z)z” 
for any z € C, and hence {p;:1<i<n}={Aj:1<i<n} 
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Example 2.4.1. Linear model in continuous time 
Consider the linear ODE 


oy = Ax (2.44) 
whose solution is the matrix exponential: 
=“ 1 
A 
Se"00), <= mri (2.45) 
m=0 


Consequently, 7; — 0 as t — oo from each initial condition if and only if A is Hurwitz: each 
eigenvalue of A has strictly negative real part. 

The solution to (2.41) is obtained with a quadratic J(x) = 27Zax, where the matrix Z can be 
found through a bit of linear algebra and calculus. The value function is non-negative, so we may 
assume Z is positive semidefinite (hence in particular, symmetric: Z = ZT). Symmetry implies, 


4 I(a,) = 22] ZAx, = t][ZA+ ATZ] 2x 
and from (2.41) this gives 
x}[ZA + ATZ]a, = —c(x1) = —2] Say 
This must hold for each ¢ and each x(0), giving the Lyapunov equation in continuous time: 


0=ZA+AIZ +S (2.46) 


Euler approximation If we sample, with constant sampling interval A > 0, then from the 
continuous time model (2.44) we obtain the linear model (2.42): with t, = kA, 


a(tea1) =e*4a(th),  k>O0 (2.47) 


The Euler approximation of (2.44) also results in the linear model (2.42), but with F = J + AA. 
The matrix F' is precisely the first order Taylor series approximation of the matrix exponential. 
While only an approximation, it is often good enough for control design. 

A particular two dimensional example is A = eure: The matrix is Hurwitz, with two 
eigenvalues A(A) = —0.2 + j. With sampling interval A = 0.02, we find that F = J + AA also has 


two complex eigenvalues: 


A(F’) = 1+ AX(A) & 0.996 + 0.02 , 


The eigenvalues satisfy |A(F’)| < 1, so we see that stability of the discrete-time approximation is 
inherited from the continuous-time model. 

The Matlab command M = dlyap(F’ ,eye(2)) returns a solution to the Lyapunov equation 
(2.43) with S =I (the identity matrix): 


= re 0.0 | 


0.0 131.9 


The fact that F’ has complex eigenvalues implies that the state process will exhibit rotational 
motion. The sample path of a shown on the left hand side of Fig. 2.7 spirals towards the origin, 
and is intuitively “stable”. The plot on the right is a simulation of the linear model subject to a 
“white noise” disturbance: 


X(k+1)=FX(k)+N(k+1), &>0 (2.48) 


See discussion below eq. (7.46) for details of the disturbance process N. 
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oe) A r2 


Figure 2.7: At left is a sample path of the deterministic linear model (2.42). At right is a sample path from the 
linear model with disturbance, (2.48). 


Example 2.4.2. Frictionless pendulum 


The frictionless pendulum illustrated on the left hand side of Fig. 2.8 is a favorite example in physics 
and undergraduate control courses. It is based on several simplifying assumptions: 


- There is no friction or air resistance 
- The rod on which the bob swings is rigid, and without mass 
- The bob has mass, but zero volume 
- Motion occurs only in two dimensions 
- The gravitational field is uniform 
- “F = MA” (apply classical mechanics, subject to the foregoing) 
A nonlinear state space model is obtained in which 2 is the angular position 0, and x2 its derivative: 
dy = f(x) = E ie | (2.49) 


Shown on the right hand side of Fig. 2.8 are sample trajectories of x;, and two equilibria. 


~ 


f 6= ~J sin(o) 


| _ \ i — (2) stable 
| @., ae (5) unstable 


Figure 2.8: Frictionless pendulum: stable and unstable equilibria for the state space model. 


An inspection of state trajectories shown on the right hand side of Fig. 2.8 reveals that the 
equilibrium x© = (6) is not stable in any sense, which agrees with physical intuition (the pendulum 
is sitting upright in this case). Trajectories which begin near the equilibrium x° = 0 will remain 
near this equilibrium thereafter. 
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The origin is stable in the sense of Lyapunov. To see this, consider a Lyapunov function defined 
as the sum of potential and kinetic energy: 


V(x) = PE+ KE = mgf[1 — cos(x1)] + gmx} 


The first term is potential energy relative to the height at the equilibrium x° = 0, and the second 
is the classical “KE = smu” formula for kinetic energy. It is not surprising that V is minimized 
at «© = 0. 

We have VV (x) = mé?[(g/@) sin(x1) , z2]7, and 


VV (a) +i) = me? { (g/€) sin(#1) +2 — r2- (g/¢) sin(a1)} =0 


This means that £V (2x1) = 0, and hence V(z;) does not depend on time. For example, the periodic 
orbit shown in Fig. 2.8 evolves in a level set of V: 
g 
L 
From this it follows that the origin is stable in the sense of Lyapunov. 
Linearization: Using the first-order Taylor series approximation sin(@) ~ 0, the state space 
equation for the pendulum can be approximated by the LTI model (2.44): 4a = Az, with 


[1 — cos(x1(t))] + 522(t)? = const. 


dt 
As ) 7 , (2.50) 
The eigenvalues of A are obtained on solving the quadratic equation 0 = det(IA — A): 
0 = aee((| ; a1) 2 to => Read/o 4) 
g/t xX 
The complex eigenvalues are consistent with the periodic behavior of the pendulum. . 


2.5 A Glance Ahead: From Control Theory to RL 


Here is a definition from Wikipedia, as seen on July 2020: Reinforcement learning (RL) is an area 
of machine learning concerned with how software agents ought to take actions in an environment in 
order to maximize the notion of cumulative reward. Here is a translation of some of the key terms: 


A Machine learning (ML) refers to prediction/inference based on sampled data. 


A Take actions = feedback. That is, the choice of u(k) for each k based on observations.” 


A Software agent = policy o. This is where the machine learning comes in: the creation of 
& is based on a large amount of training data collected in “the environment”. 


a Cumulative reward = negative of the sum of cost, such as (2.24), but with the inclusion 
of the input: 
Cumulative reward = — S c(a(k), u(k)) 
k 

An emphasis in the academic community is truly model-free RL, and most of the theory builds on 
the optimal control concepts reviewed in the next chapter. Some of the main ideas can be exposed 
right here. 

What follows is background on how RL algorithms are currently formulated. Think hard about 
alternatives — remember, the field remains young! 


°The term features is a common substitute for the observation process y shown in Fig. 2.1. 
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2.5.1 Actors and critics 


The actor-critic algorithm of reinforcement learning is specifically designed within the context of 
stochastic control, so this is a topic for Part 2. The origins of the terms are worth explaining here. 
We are given a parameterized family of policies {6% : 6 € R“}, which play the role of actors. For 
each @ we (or our “software agents”) can observe features of the state process x under chosen the 
policy. The ideal critic then computes exactly the associated value function Jg, but in realistic 
situations we have only an estimate. 

Since in this book we are minimizing cost rather than maximizing reward, the output of an 
actor-critic algorithm is the minimum 


6* = arg min(v, Jp) (2.51) 
6 


where v > 0 serves as a state weighting. This will be defined as a sum 


(v, Jo) = Yo Jo(a')v(a") 


where v(z") is relatively large for “important states”. 

Methods to solve the optimization problem (2.51) are explored in Section 4.6, using an approach 
known as gradient free optimization. These algorithms are intended to approximate the true gra- 
dient descent algorithms of optimization surveyed in Section 4.4, and are often called “actor only 
methods”. The meaning of actor-critic methods is explained in Chapter 10. 

This is an example of ML: optimizing a complex objective function over a large function class 
for the purposes of prediction or classification (in this case we are predicting the best policy). A 
very short introduction to ML can be found in Section 5.1. 


2.5.2 Temporal differences 


Where do we find a critic? That is, how can we estimate a value function without a model? One 
answer lies in the sample path representation of the fixed policy dynamic programming equation, 
previously announced in (2.30). For any @ we have 


Jo(a(k)) = c(a(k), u(k)) + Jo(a(k+1)), k 20, ulk) = 6"(2(k)) 


We might seek an approximation J for which this identity is well approximated. This motivates 
the temporal difference (TD) sequence commonly used in RL algorithms: 


Dusi(J) = —F(w(k)) + F(ae(k + 1)) + e(a(k), u(k)), =k >0, ulk)=o%(x(k)) (2.52) 


After collecting N observations, we obtain the mean-square loss: 
. ; aS eg 
=e d [Pr+i(J)] (2.53) 


We are then faced with another machine learning problem: minimize this objective function over 
all Jina given class (for example, this is where neural networks frequently play a role). 

If we can make (2.53) nearly zero, then we have a good estimate of a value function. Beyond its 
application to actor-critic methods, there are TD- and Q-learning techniques, designed to minimize 
(2.53) or a surrogate, that are part of a bigger RL toolbox. 
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2.5.3 Bandits and exploration 


e 


Suppose that our policy is pretty good. Maybe not optimal in any sense, but x(k) > 2°, u(k) > u 
rapidly as k — oo, where the limit satisfies c(x°,u°) = 0. We typically then have continuity: 


n 


lim [-F(a(k + 1)) + J(2(k)) — c(a(k), u(k))] = —J(x°) 4 I(2°) eau") =O (2.54) 


k- oo 


It follows that we aren’t observing very much via the temporal difference (2.52). If N is very large 
then re(s ) = 0. This essentially destroys any hope for a reliable estimate of the value function. 
Expressed another way: a good policy does not lead to sufficient exploration of the state space. 
There are many ways to introduce exploration. We can for example adapt our criterion as 
follows: denote by [’(J;x) the mean-square loss obtained with x(0) =z. Rather than take a very 
long run, perform many shorter runs, from many (M > 1) initial conditions. The loss function to 


be minimized is the average 
M 
x 1 ey 
r= Vi = ia 0 ead (2.55) 


The best way to choose the samples {2} is a topic of research. 
Another approach is to let the input do the exploring. The policy is modified slightly through 
the introduction of “noise”: 


u(k) = 6(2(k), E(k) 


For example, {&(k)} might be a scalar signal, defined as a mixture of sinusoids. The noisy policy 
is defined so that 


(i) b(a(k), E(k) © °(a(k)) for “most k” 
(ii) The state process “explores”. In particular, the policy is designed to avoid convergence of 
(x(k), u(k)) to any limiting value. 


This is a crude approach, since by changing the input process, the associated value function also 
changes. More sensible approaches are contained in Chapters 4 and 5, and in the second part of 
the book: Q-learning and “off policy SARSA” might be designed around an exploratory policy like 
this one, but these algorithms are carefully designed to avoid bias from exploration. 

The theory of exploration is mature only within a very special setting: multi-armed bandits. 
The term “bandit” refers to slot machines: you put money in the machine, pull an arm, and hope 
that more money pops out. A more rational application is in the advertisement industry, in which 
an “arm” is an advertisement (which costs money), and the advertiser hopes money will pop out 
as the ads encourage sales. There is a great history of heuristics and science to create successful 
algorithms to maximize profit, based only on noisy observations of the performance of candidate ads 
([216] is a great reference on the theory of bandits, and a short survey is contained in Section 7.8). 
It is here that the “exploration/exploitation” tradeoff is most clearly seen: you have to accept some 
loss of revenue through exploration in order to learn the best strategy, and then “exploit” as you 
gain confidence in your estimates. 

The situation is much more complex in control applications: imagine that for each state x(k), 
there is a multi-armed bandit. “Pulling arm a” at time k means choosing u(k) = a € U. Concepts 
from bandit theory have led to heuristics to best balance the exploration/exploitation tradeoffs 
arising in RL. This is an exciting direction for future research [307, 171]. 
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2.6 How Can We Ignore Noise? 


It is hard to explain this precisely to a student without background in probability theory. If you 
have some exposure to stochastic processes, then you might want to skim Section 7.2: you will learn 
how to construct a deterministic “fluid model” or “mean-field model” based on a more detailed and 
complex stochastic state space model, and find justification for control design based on the simpler 
model. 

The pragmatic answer to this question is that we rarely have a reliable model of disturbances, so 
we leave them un-modeled but not ignored. That is, we attempt to create a control architecture that 
is not very sensitive to disturbances. There is an elegant theory of robust control for this purpose, 
though even here “robustness” is only with respect to disturbances within some uncertainty class. 
The most successful outcomes of this literature lean heavily on frequency domain concepts. For 
example, it is assumed that disturbances (the d shown in Fig. 2.1) are largely limited to lower 
frequencies, and measurement noise (the w shown on the right hand side of this figure) is limited 
to higher frequencies. 

Justification for nonlinear control systems is based on Lyapunov function techniques. We es- 
tablish stability of our control solution through a Lyapunov function V as outlined in Section 2.4.3, 
and then argue that V will continue to have “negative drift” in the form (2.31) even with error in 
the model F, or in the presence of the disturbance d. 

Finally, the naive “disturbance free” model obtained through physics, or through techniques 
surveyed in Section 7.2, often provides a great deal of insight for the structure of control solutions. 
We might use this insight to build architectures for reinforcement learning. 


2.7 Examples 


2.7.1 Wall Street 


Let’s begin with an example that clearly does not belong in this chapter. Search for “flash crash” on 
your internet browser to see images of the enormous volatility of stock prices on many time scales. 
While we have few tools for control design at this stage of the book, there are many interesting 
modeling questions that will help illustrate control and RL philosophy. 


Where is the control problem? Let’s consider the specific problem of stock portfolio manage- 
ment. The goal is to create a computer program that makes decisions second-by-second on which 
stocks to buy or sell. The goal is to “maximize profit”, but there is also the notion of risk that is 
not easily defined without tools from probability and statistics. 

Perhaps more significant is that this control problem is not of the centralized variety. Consider 
how Fig. 2.1 is interpreted for stock trading. The Process is the global economy and everything that 
goes along with it! The two blocks State Feedback and Observer are the results of thousands of 
individual decision makers (the “agents” ) who forecast future prices (and other events), and employ 
optimization strategies for online decision making. Trajectory Generation will also be local to 
each agent: this might represent decisions regarding purchase orders for new computers, new staff, 
or a new office closer to Wall Street. 

In summary: stock-trading is a game, rather than a classical control problem, but this should 
not stop us. As an individual (or company designing software for others), we can treat the “process” 
along with the actions of all other players as a larger process. Reinforcement learning is an appealing 
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approach to control design because the learning (or training) does not require a detailed model 
(though significant data is required for training). 

This is a great example of the value of both measurements and actuation in control. The better 
your measurements, the more money you can expect to earn in an optimal control solution—there 
is no better example to illustrate this point. The book Flash Boys contains a popular treatment 
of the role of actuation; in particular, the cost of delay in the feedback loop [223] (see also [30]). It 
is claimed that millions of dollars can be made by reducing response delay by a millisecond! 
State feedback? How do we interpret u(k) = b(x(k))? The input u(k) is easy to understand, 
given the description above of the stock portfolio management problem. 

What is the state x(k)? I don’t know, and I would not trust anyone who claims to have an 
answer! It is traditional to view prices as a stochastic process that evolves according to the actions of 
millions of citizens and hundreds of corporations. There is modeling theory based on martingales 
and changes of measure, so theory from mathematical finance may provide intuition on how to 
construct a state process. A quick “gut reaction” might be this: x(k) = x°(k), the vector of all 
stock prices at time k. Without any knowledge of finance, my gut tells me that this would be a 
huge mistake. Here are examples of what many would add after further reflection: 


(i) Past history of prices. It is important to visit recent performance in terms of both trends 
and volatility. 


(ii) Forecasts of prices. You may have insider knowledge. You may realize that tweets from 
certain influential people provide insight on the decisions of others, which will then influence 
stock prices. 


(iii) What is the objective? Once you have a formulation of reward and risk, make sure that 
these essential quantities are functions of your state process. 


You then have a very high dimensional vector x(k), and are left to find the feedback law o. 

There is no perfect state description. Even if a state space model were available, the full state 
would not be directly observed (and we would still want to use “side information”, such as the 
tweets of CEOs and politicians). Appendix C contains a summary of belief states for partially 
observed control problems. This is an elegant way to create a fully observed state for the purposes 
of control, but comes with enormous cost in terms of complexity. 


What follows are toy examples which will be useful for applying the methods to be developed 
over the course of this book. The models are presented in continuous time because of the elegance 
of calculus and classical mechanics. 


2.7.2. Mountain Car 


The goal is to drive a car with a very weak engine to the top of a very high mountain, as illustrated 
in Fig. 2.9. 

A two dimensional state space model is obtained using position and velocity x = (z, v4)™, and 
the input u is the throttle position (which is negative when the car is in reverse). In the following 
the state space is defined to be a rectangular region, 


X=([z™", 2" x [—9, 7] 


in which z™" is a lower limit for the position z, and the target position is z®"'. The constraint 
zz © X means that the velocity v; is bounded in magnitude by v > 0. 
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Within the RL literature, this example was introduced in the dissertation [264], and has since 
become a favorite basic example [338). 


3,747m 


Elevation 


Figure 2.9: Mountain Car 


What makes this problem interesting is that the engine is so weak, that it is impossible to reach 
the hill directly from some initial conditions. A successful policy will sometimes put the car in 
reverse, and travel at maximal speed away from the goal to reach a higher elevation to the left. 
Several cycles back and forth may be required to reach the goal. 

A continuous-time model can be constructed based on the 

two forces on the car, illustrated in Fig. 2.10. To obtain a simple 

Py model, we need to be careful with our notion of distance: z&°*!— z; 

denotes the path distance along the road to the goal, which is not 

the same as the distance along the x-azis in Fig. 2.9. Subject to 
this convention, Newton’s law gives 


d2 
ma =m 52 = —mg sin(@) + Ku 


mg mg cos(9) 


Figure 2.10: Two forces on the 
Mountain Car With state x = (z,v)', we arrive at the two dimensional state 
space model, 
oo, = 22 


d 


K 2.56 
eee = u— gsin(0(21)) oo 


where 0(21) is the road grade at z = 7}. 

An examination of the potential energy U tells us from 

which states we can reach the goal without control (setting 

u = 0 in (2.56)). The potential energy is proportional to 

elevation, and can be computed by integrating the nega- 

tive of force, —F(z). For the control-free model we have 
—F(z) = mgsin(0(z)), and hence 


U(z) =U(0) + mg | sin(8(z))dz (2.57) eer in c 
0 


Figure 2.11: Potential energy for 
Mountain Car. 


The version of this model adopted in [338, Ch. 10] uses 
these numerical values: 
Kmail, g=2.5, Be ar+32 


In this case (2.57) gives U(z) = U(0) + mgsin(3z)/3. Fig. 2.11 shows the potential energy as a 
function of z on the interval [z™", z®*']. It has a unique maximum at z®", which implies that it is 
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necessary to apply external force to reach the goal for any initial condition satisfying z(0) < 2" 
and u(0) < 0. 

Is the goal reachable? We again examine potential energy. Consider the force as a function of 
z with u(k) = 1 for all k. We obtain —F'(z) = mgsin(0(z)) — k, and the resulting potential energy 
is the integral, denoted U/1(z) = U(z) — kz and shown in Fig. 2.11. We now have U(z™") > U(z®"), 
so from z(0) = z™" we will reach the goal with this open-loop control law. 

Consider the initial position z° = —0.6, for which U/!(z°) is indicated with a dashed line, and 
let z' denote the other value satisfying zt > z° and U1(z') = U(2°). If u(k) = 1 for all k, then 
with initial condition z(0) = —0.6 and v(0) = 0, the car will initially move to the right, and stall 
at time t, for which z(t,) = z!. It will then reverse direction until it stalls at location z°, and this 
process will repeat. 

A discrete time model is adopted in [338, Ch. 10], based on sampling the ODE with sampling 
interval A = 10~%: using the notation «(k) = (z(k), v(k))’, 


2(k +1) = [z(k) + Av(k+ 1a (2.58a) 
u(k + 1) = [v(k) + A[u(k) — 2.5 cos(3z(k))] Jo (2.58b) 


This can be expressed in the form (2.6a) by substituting the expression for v(k + 1) in (2.58b) into 
the right hand side of (2.58a). 

The model is consistent with (2.56) using 0(z) = 7+ 3z. The brackets denote projection of the 
values of z(k +1) to the interval [z™", 28], and v(k + 1) to the interval [—v, 0]. In addition, the 
constraint v(k) > 0 is imposed when z(k) = z™", and u(k) = 0 when z(k) = z®™*' (the car is parked 
once it reaches its target). The following values are chosen in numerical experiments: 


gmin —1.2, Boal U5, and v = 70. (2.58c) 


Here is an aggressive policy that will get you to the top: whatever direction you are going, 
accelerate in that direction at maximum rate (provided this is feasible): 


_ Jo Ak) a=" 
= a else cane 


If v(k) = 0, then sign(v(k)) can be taken to be 1 or —1, subject to the constraint that v(k+1) 4 0. 


2.7.3 MagBall 


The magnetically suspended metal ball illustrated in Fig. 2.12 will be used to illustrate several im- 
portant modeling concepts. In particular, how to transform a set of nonlinear differential equations 
into a state space model, and how to approximate this by a linear state space model of the form 
(2.20). Further details from a control systems perspective may be found in the lecture notes [29]. 

The input wu is the current applied to an electro-magnet, and the output y is the distance 
between the center of the ball and the bottom edge of the magnet. Since positive and negative 
inputs are indistinguishable at the output of this system, it follows that this cannot be a linear 
system. The upward force due to the current input is approximately proportional to u?/y?, and 
hence from Newton’s law for translational motion we adopt the model 


ue 


d2 
Mma = may = mg * ye 
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Bee Sos ee ee, Be ee nee ee Reference distance r > 0 


Figure 2.12: Magnetically Suspended Ball 


where g is the gravitational constant and k is some constant depending on the physical properties 
of the magnet and ball. 

Control design goal: maintain the distance to the magnet at some reference value r. 

We obtain a state space model as a first step to control design. This input-output model can 
be converted to state space form to obtain something similar to the controllable canonical form 
description of the ARMA model in (2.16,2.18): using 71 = y and rg = fy, 


d _ d _ 
gel = £2, qder2-—JI-— 95 


where the latter equation follows from the formula fx = Ly. This pair of equations defines a 
two-dimensional state space model of the form (2.19): 


eal =o = Ti (wi, ve, tt) (2.60a) 
2 

4 K wu 

doy = g— ——__ =f. 2.60b 

gt? —I- TD 2(@1, £2, U) ( ) 


It is nonlinear, since fg is a nonlinear function of x, and also the state space is constrained: X = 
{x € R?: x; > Of. 

Suppose that a fixed current u° > O is applied, and that the state x° is an equilibrium: 
f(z°,u°) = 0. From the definition of f; in (2.60a) we must have «5 = 0, and setting fo(x°,u°) 


equal to zero in (2.60b) gives 
K 
5, =,/—u? >0 2.61 
xy Nong” > (2.61) 


If we are very successful with our control design, and 2; = (r,0)7 for all t, then we must have 
mau, +20, where u° = r\/mg/k: the solution to (2.61) with ef =r. 


Of course, we don’t expect that this “open loop” approach will be successful. If we are realistically 
successful, so that x; ~ r for all t (perhaps after a transient), then we should expect that uz ~ u° 
as well. The design of a feedback law to achieve this goal is often obtained through an approximate 
linear model, called a linearization. 
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Linearization about an equilibrium state The linearization is defined exactly as in the fric- 
tionless pendulum (2.49). Assume that the signals x1, x2 and u remain close to the fixed point 
(x§, 25, u°), and write 


Ty = Br L1 
t2Q = 25 i L2 
u= ut+t 


where £1, 2, and & are small-amplitude signals. From the state equations (2.60) we then have 


d = ° ~ ~ 
Goll = %at%o=Xe 
fi = fo(a?+%1,23 + %,u° +H) 


Applying a first-order Taylor series expansion to the right hand side of the second equation above 
gives 
Of 


Of 

~ ) ) 

{2% = fo(xi,xr5,u°) + — %+—- x 
dt ( ras ) Oxy (x9 ,@§,u°) Ox (x$,a§,u°) 


— u+d 
Ou ae 


The final term d represents the error in the Taylor series approximation. After computing partial 
derivatives we obtain 


1%, = £9 

o\2 ° 
ra ” . K (u°) K U 
2% =art,+ but+d with a=2—+— p=—2— 
a m (2)8 m (x9)? 


This can be represented as a linear state space model with disturbance: 


ge=[ 9 pfa+[ 5 fae[ fa g=%1 (2.62) 


There is a hidden approximation in (2.62), since d is in fact a nonlinear function of (z,u). In 
control design this approximation is taken one step further by setting d = 0, to obtain the linear 
model (2.20). The approximate model is not very useful for simulations, but often leads to effective 
control solutions. 


2.7.4 CartPole 


The next example has a long history within the control systems literature [258, 331, 14], and was 
introduced to the RL literature in early research of Barto and Sutton [26]. It is today a popular 
test example on openai.com. A history from the perspective of control education can be found in 
[385], which provides the dynamic equations with state x = (z, ¢,0,0), where z is the horizontal 
position of the cart, and the angle 0 is as shown in Fig. 2.13. 

The control design goal is regulation: keep 8 = 0 while the cart is moving at some desired speed, 
or some desired fixed position. The aforementioned references describe several successful strategies 


Pre-publication draft -- March 25, 2022 


CHAPTER 2. CONTROL CRASH COURSE 40 


to swing the pole up to a desired position without excessive energy. A normalized model used in 
[385] is given by 


tz= tx = 22 fr, =U 
d d d i (2.63) 
70 = 304 — 04 4,04 = sin(73) — wcos(x3) 


The state equations are easily linearized near the equilibrium u° = 0 and x® = (z°,0,0,0)" for any 
z©: using the first order Taylor series approximations sin(x3) * x3 and cos(x3) * 1, we obtain as 


in the derivation of (2.62) 


dz _ dz _ 

ql = v2 qaev2 = U 

eis 7 a 7 (2.64) 
Ge3 = 04 ges = 23—ut+d 


Ignoring the “disturbance” (error term) d, the ODE (2.64) is a 
version of the state space model (2.20) with 


01 0 0 0 
000 0 1 
a= 000 1]’ a 0 
0010 —1 
The matrix A is not Hurwitz, with eigenvalues at +1 and repeated M __E lore 


eigenvalues at 0. This was anticipated at the start: it is unlikely 
that the pendulum will remain upright with a constant “open loop” —— A 
input, u; = 0. The linear model is of great value for insight, and a ear 
designing a linear feedback law to keep the system near the equi- 

librium: 


Figure 2.13: CartPole 
u=—-Kz 


Methods to obtain the 4 x 1 matrix K through optimal control techniques will be investigated later 
in the book. 

In conclusion, we know what to do locally, but the linearization provides no insight whatsoever 
on how to swing the pendulum up to the desired vertical position. The robotics community has 
developed ingenious specialized techniques for classes of nonlinear control problems that include 
CartPole as a special case (see [331, 14, 385, 83], and Exercise 3.10 for a survey of the approach of 
[14]). In the near future we hope to marry existing control approaches with model free techniques 
from RL to obtain reliable control designs in more complex settings. 


2.7.5 Pendubot and Acrobot 


Fig. 2.14 shows an illustration of the Pendubot as it appeared in the robotics laboratory at the 
University of Illinois in the 1990s [330, 29], and a sketch indicating its component parts. It is similar 
to Sutton’s Acrobot [341], which is another example that is currently popular on openai.com. The 
control objective is similar to CartPole: starting from any initial condition, swing the Pendubot up 
to a desired equilibrium, without excessive energy. 

The value of this example is explained in the introduction of [330], where they compare to 
CartPole, and a variation of Furuta [137]: 


“The balancing problem for the Pendubot may be solved by linearizing the equations 
of motion about an operating point and designing a linear state feedback controller, 
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very similar to the classical cart-pole problem ... One very interesting distinction of the 
Pendubot over both the classical cart-pole system and Furuta’s system is the continuum 
of balancing positions. This feature of the Pendubot is pedagogically useful in several 
ways, to show students how the Taylor series linearization is operating point dependent 
and for teaching controller switching and gain scheduling. Students can also easily 
understand physically how the linearized system becomes uncontrollable at q, = 0, 
+7.” (referring to the first and third illustrations shown in Fig. 2.14 (b), with q, q@ 
joint angles shown in Fig. 2.15). 


BL DC Motor Mi Encoder 1 


Link 1 
' Encoder 2 ry. 
el © 
Link 2 — 
(a) Pendubot components (b) Three potential equilibria 


Figure 2.14: (a) The Illinois Pendubot, showing component parts. (b) A continuum of equilibrium positions. 


The Pendubot consists of two rigid aluminum links: link 1 is directly coupled to the shaft of a 
DC motor mounted to the end of a table. Link 1 also includes the bearing housing for the second 
joint. Two optical encoders provide position measurements: one is attached at the elbow joint and 
the other is attached to the motor. Note that no motor is directly connected to link 2—this makes 
vertical control of the system, as shown in the illustration, extremely difficult! 

The system dynamics can be derived using the so-called Euler-Lagrange equations found in 
robotics textbooks [332]: 


did, + diadg +hi +o. =7 (2.65a) 
did, + d22gz + ho + o2 =0 (2.65b) 


where the variables can be deduced from Fig. 2.15. Consequently, this model may be written in 
state space form, 4x = f(x,u), where x = (q1, 2, (1, 92)", and f is defined from the above equations. 

This model admits various equilibria: for example, when u© = T° = 0, the vertical downward 
position x° = (—7/2,0,0,0) is an equilibrium, as illustrated on the right hand side of Fig. 2.14. 
Three other possibilities are shown in Fig. 2.14 (b), each with 7° 4 0. 

A fifth equilibrium is obtained in the upright vertical position, with T° = 0 and x® = (+7/2,0,0,0)T. 
It is clear from the illustration shown on the left hand side of Fig. 2.14 that the upright equilibrium 
is strongly unstable in the sense that with 7 = 0, it is unlikely that the physical system will remain 
at rest. Nevertheless, the velocity vector vanishes, f(x°,0) = 0, so by definition the upright position 
is an equilibrium when 7 = 0. 

Although complex, we may again linearize these equations about the vertical equilibrium. With 
the input u equal to the applied torque, and the output y equal to the lower link angle, the resulting 
state space model is defined by the following set of matrices in the 1990’s vintage system described 
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uf | / dy = 1m) €2, + mo(@ + ©, + 261 £02 cos(qz2)) + + Ip 
dyg = M2 + Ip 
dig = do = Mo(E2q + Lilo cos(qa)) + Io 
hy = —moblylco sin(q2)q — 2m2e1 Lee sin(q2) 4241 
hg = meeylc2 sin(q2)d? 
di = (miler + ml1)g cos(qi) + Mofc2g cos(qi + G2) 
ca 62 = me2b.2g cos(qi + 2) 


Figure 2.15: Coordinate description of the Pendubot: ¢1 is the length of the first link, and ¢c1, €-2 are the distances 
to the center of mass of the respective links. The variables qi, gz are joint angles of the respective links, and the input 
is the torque applied to the lower joint. 


in [330]: 
0 1.0000 0 0 0 
He 51.9243 0 —13.9700 0 Be 15.9549 
— 0 0 0 1.0000 a 0 (2.66) 
—52.8376 068.4187 0 0 —29.3596 


C=[1 00 0] D=0. 


Postscripts For those students who have had a course in undergraduate control systems, the 
corresponding transfer function has the general form 


(s—y)(s +7) 
(s— a)(s + a)(s — B)(s +B)’ 


with k >O0 and0<a<y< 8. The variable “s” corresponds to differentiation. Writing 


s* — 74? 


= ae — 2(a? + B?)s? + a? 6? 


the transfer function notation Y(s) = P(s)U(s) denotes the ODE model: 


d* 2. and 2 a2 d? 2 
qad ~ 2le +6 aad + BG = kage — 9 ul 


The roots of the denominator of P(s) are {+a,+}, which correspond with the eigenvalues of 
A. The positive eigenvalues mean that A is not Hurwitz. The fact that P(so) = 0 for the positive 
value so = y implies more bad news (a topic far beyond the scope of this book, but the impact of 
zeros in the right-half plane is worth reading about in basic texts, such as [205, 7, 77, 15]). 
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2.7.6 Cooperative rowing 


In a sculling boat, each rower has two oars or ‘sculls’, one on each side of the boat. The control 
system discussed here concerns coordination of N individual scullers (meaning just one rower per 
boat) that are part of a single team. You can see 5 of N teammates on the left hand side of 
Fig. 2.16. The team objective is to maintain constant velocity towards a target (let’s say, the island 


of Kaua’i), and also maintain “social distance” between boats. 


A state space model might be formulated as follows. Let z} denote the distance from the origin, 
and u} the force exerted by the rower at time t. Taking into account the fact that drag increases 
with speed, and applying once more Newton’s law f = ma, results in the following system equations 


Figure 2.16: Cooperative rowing with partial information. 


2. . : , 
oyzt — —a; $2" + bju" — d’ 


in which {a;,b;} are positive scalars, and the disturbance {di} is left un-modeled. If we ignore the 
disturbance (for the purposes of control design), we can pose the rowing game as a linear-quadratic 
optimal control problem: a topic covered in Sections 3.1 and 3.6. We will see that this will result 
in a policy of the form 
u=K'e+r' 

where x is the 2N-dimensional vector of positions and velocities for all the rowers, K* is a 2N- 
dimensional row vector, and r? is a scalar function of time that depends upon the tracking goal. 
Implementation of this policy requires that each rower know the position and velocity of every other 
rower at each time. Let’s think about how the rowers might cooperate without so much data. 

Imagine that each rower only views the nearest neighbors to the left and right. This breaks 
the team of size N into (overlapping) sub-teams of size three that coordinate individually. Un- 
fortunately, if N is large, it is known that this distributed control architecture can lead to large 
oscillations in the positions of the boats with respect to the distant island [130]. 

The theory of mean field games suggests that a more robust strategy is obtained with just a bit 
of global information: assume that at each time t, rower i has access to three scalar observations: 
her own position and velocity z/,v}, and the average position of all rowers: 


ee a 
— d zi (2.67) 
a= 


One possibility is to pretend that xt = (zi, uv}, %)T evolves according to a state space model of the 
form (2.19), in which case it is appropriate to search for a state feedback policy uj = '(x}). 
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Before fixing the architecture of the policy it is essential to consider the goals. Since we have 
assumed that social distancing is managed through an independent control mechanism, there remain 


only two: 


ze Ze, y=tarv — for all large t 


Based on the discussion in Section 2.3.2 we might obtain better coordination through the introduc- 
tion of a fourth variable, defined as the integral of the position error: 


t 
eo =o, +f [z). — Z,| dr 
0 
or the discounted approximation, 
a Sg +f elt") [ot _ zz) dr 
0 


with @ > 0. Once we have made our choice, we then search for a policy defined as a function of the 
four variables, ut = o*(z!, vt, %, z{*). 

However, do not forget that this is a game The “best” choice of cb’ will depend upon the choice 
of c/ for all 7 4 i. We might experiment with “best response” schemes designed to learn a collection 
of policies {’ : 1 <i < N} that work well for all. Best response is also behind the RL training in 
AlphaZero [322]. 


2.8 Exercises 


2.1 Controllable Canonical Form. Consider the state space model (2.18) with X = R?. 


(a) If the input is defined by u(k) = —Ka(k) + v(k) for a 1 x 3 gain matrix K, obtain a state 
space model in controllable canonical form (with new input v). 


(b) For the special case n = 3 (so that F' is a 3 x 3 matrix), design K so that the eigenvalues of 
F —GK are each located at 1/2. Perform this calculation by hand. Your answer will depend upon 
{a1,a2,a3}. Based on your effort, explain why this is called controllable canonical form! 


(c) Think a bit deeper: have you solved a control problem? With state x(k) defined via (2.17), 
would you say you are making good use of your output measurements? 


If you are baffled, then seek advice from your professor, fellow students, and a good book on state 
feedback methods! 
2.2 Controllability and Observability. Consider the linear state space model 

a(k+1) = Fa(k)+Gu(k), 2«(0)= (1) 


y(k) =Ha(k) with F= iG 4 C= Hl ae ql (2.68) 


If you have taken a state space controls course, then you know that this system is not controllable 
and not observable. If you don’t have this background, then you might be able to guess the 
definitions of these terms after completing this exercise. 


(a) Can you find a feedback law u(k) = b(x(k)) that results in a bounded output y? 
(b) Does the situation improve if H = [1 0]? 
(c) How about if G = [0 1]? 
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2.3 Stabilizability. The state space model is called stabilizable if there is a feedback law u(k) = 
(x(k)) that results in a closed loop system that is globally asymptotically stable. The example in 
Exercise 2.2 is not stabilizable. 

2 1 


Perform the following calculations with F' = | 0 0. a and G = a : 

(a) Design the gain in u(k) = —Ka(k) so that F — GK has repeated eigenvalues (you will see 
that you do not have choice in the value). Is AK unique? 

(b) Solve the Lyapunov equation eq. (2.43) with F' replaced by the closed-loop matrix F — Gk 
from (a), and with S = J. 

(c) Denote y(k) = 21(k) = H2(k). Suppose that our goal is to ensure that y(k) > ras k > ~, 
with r a constant. Modify your control design as follows: 


u(k) = —kyy(k) = K2x9(k) = K32'(k) 


where y(k) = y(k) —r and z’(k + 1) = 2‘(k) + y(k) (review discussion surrounding eqn. (2.11)). 
Find K3 > 0 sufficiently small so that the system remains stable for 0 < K3 < K3. This is possible 
because of the inherent robustness of feedback (you verified stability when K3 = 0). 
(d) Obtain a state space model for the system in closed loop, with augmented state 2% = 
(11,22, 2"): 

x°(k +1) = F°r(k) + Gr 
where F is 3 x 3 and G® is 3 x 1. Plot the eigenvalues of Ff” for a range of values of K3 > 0, and 
comment on your findings. 
Solve the equilibrium equation (for your favorite control design): 7*(oo) = F%x"(oo) + G®r. Is your 
equilibrium x°(oo) consistent with your control goals? 
Obtain a plot of y(k) as a function of k, with initial condition 71(0) >> r, and verify that it converges 
to the desired limit, and at the predicted rate. 


2.4 Consider the scalar state space model, x(k + 1) = x(k) — ax(k)?. 


(a) Show that the origin is stable in the sense of Lyapunov, and estimate the region of attraction 
(which will depend upon a). 


(b) Explain why this state space model is not globally asymptotically stable. 


The state process x is in fact an Euler approximation of the ODE ig = —x?. See Exercise 2.15 
for some interesting features of the solution. 


Control systems in continuous time: 
For simulating an ODE you might try ode45 in Matlab. There are several Python alternatives. 


2.5 (Integral control design). The temperature T' in an electric furnace is governed by the linear 
state equation 
£7 =ut+w 


where wu is the control (voltage) and w is a constant disturbance due to heat losses. It is not 
directly observed. It is desired to regulate the temperature to a steady-state value prescribed by 
the set-point T = T°, where T° is your comfort temperature. The following should be solved by 
hand: 
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(a) Design a state-plus-integral feedback controller to guarantee that T; > T° as t + oo, for any 
constant w. This can take the form u = —Ky(T — T°) — Kz! with 


t 
Aaah | [T, — T°] dr 
0 


The closed loop poles should have natural frequency w, * 1 (that is, the eigenvalues of the 2 x 2 
matrix that defines the closed loop state space model should satisfy |A| + 1.) 


(b) To what value does the control u; converge as t > co? Has the controller “learned” w? 


-1 4 
2.6 Solve the following based on the linear state space model fa = Ax with A= | 0 | 
(a) Show that V(x) = |\a||? = 27 + x3 is not a Lyapunov function. 
(b) Find a quadratic function V that is. 


(c) Consider the Euler approximation x(k +1) = Fx(k) with F =1+ AA, and A> 0. Estimate 
the range of A > 0 for which your function V from part (b) is a Lyapunov function for this discrete 
time system. Is this range complete? That is, does it include all values for which the eigenvalues 
of F lie in the open unit disk in C? 


2.7 In this exercise you will consider a particular control architecture for cooperative rowing, 
using a simplification of the model described in Section 2.7.6. Consider the homogeneous and 
disturbance-free system 

Ge = age +u', 1L<i<N 
with a > 0. The goal is to maintain vi = 4 zi wu for all t, and zi ~ 2 for each i,t, with % the 
average position (recall (2.67)). We wish to achieve these objectives without requiring that each 
rower have complete observations. 


The following control architecture is of the category studied in [130]: 


ub = —K_[z' — 21] — Ky[z* -— 2411] — K,[S2* - 4, 1<i<N 


N N+1 


where for notational convenience we interpret z? = 2% and z = z!. This architecture is well- 
motivated in terms of goals, and the desire to make decisions based on only local information. 
Unfortunately, theory predicts problems when JN is large. 


(a) Describe the closed loop dynamics as a 2N dimensional state space model, with constant 
input uv. This will have the form, for some matrix K and vector g, 


fe =(A-BK)x+ gv" 


The remainder of the exercise is numerical, with a = v' = 1, K_ = K,, and a several values of N 
(say, 10, 500, 5,000): 

(b) Choose non-negative gains K,, K, so that the closed loop system is stable, in the sense that 
the key error terms are bounded as functions of time, and convergent: 


e = lim (zi — %), e!, = lim (vj — uv“) 


t-00 t-00 


See if you can obtain gains so that |e’,| < 0.05. 
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Note that you do not yet have any tools to efficiently compute the control gains. Just experiment 
until you find something that works. 


(c) Obtain a plot of the eigenvalues of A — BK for the chosen values of N. Do you find complex 
eigenvalues? Eigenvalues at zero? 


(d) Simulate your control design for various non-ideal initial conditions. Think hard about how 
to plot your results to display the poor behavior of these scullers. Discuss your findings. 


2.8 Let’s now consider the rowing game in which each rower has access to the average position 
(2.67), and the control architecture 


au’ = —K,|z' — 2] — Kz" — K,[242-u"], 1<¢<N (2.69) 


where in the notation of Section 2.7.6, 


t 
Sa, +f [2 — Zr] dr 
0 


Repeat (a)—(d) of Exercise 2.7 based on this policy. 


2.9 You are given a nonlinear input-output system defined by the nonlinear differential equation: 


j=y"(u—y) +20 (2.70) 
(a) Obtain a two-dimensional nonlinear state-space representation with output y, input u, and 
states 71 = y and rg = y — 2u. 


(b) Linearize this system of equations around its equilibrium output trajectory when u = 1, and 
write it in state space form. 


(c) For those of you with background in classical control: Find the transfer function for the linear 
system obtained in (b), and comment on the implications. 


(d) Obtain a linear compensator u = —K<@ for the linearization, where = (y — 1,y)™. To be 
successful, you want %; — 0 as t > oo for each initial condition for which ||Zo|| is sufficiently small. 


2.10 We now consider (2.70) subject to a constant disturbance: 
g(t) =y?(u—y) + 24d 


where the value of d is not known in advance. In this case we cannot expect perfect tracking unless 
we introduce integral control: 


t 
u=—-Kz", where &* =(y—1,9,2’)', a= [tar 
0 


Find a 1 x 3 row vector K so that this control design is stabilizing in the sense that <° is bounded, 
and & vanishes for “small” initial conditions. Perform simulations to verify that perfect tracking is 
achieved for initial conditions near the equilibrium value, and any fixed value of d satisfying |d| < 1. 


2.11 Consider the state space model fa = Ar+ Bu; y = Cz, where A is similar to a diagonal 
matrix. That is, A = V~'AV where A is a diagonal matrix, with each A(i,i) an eigenvalue of A, 
and V is a matrix whose columns are eigenvectors. 
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(a) Obtain a state space model for 7 = V~'a, of the form {TE = Az+ Bu; y = C@, by finding 
representations for (A, B,C). This state space representation is called modal form. 


The remainder of the problem is numerical, using 


eS 27 0 
A=|8 -10 —4 B= |0 C=[100] 
-4 5 2 1 


(b) Find the eigenvalues and eigenvectors of A, and verify that the matrix A = V~! AV is indeed 
diagonal when V is the matrix of eigenvectors. 


(c) Obtain a state space model in modal form. 


2.12 (Foster’s Criterion) Suppose that og = f(x) is a nonlinear state space model on R”. Assume 
also that there is a C! function V: R” > R,, and a set S' such that, 


(VV(0),£(@)) <-1,  0ES° (2.71) 


Foster introduced a version of this stability criterion for Markov chains in the middle of the last 
century [135]. 
(a) Show that T(x) < V(x) for z € R”, where 


Tx(z) =min{t>0:a2€ K}, zo =z € R”. 


(b) In the special case of a stable linear system [f(a~) = Ax, with A Hurwitz], show that a solution 
to (2.71) is given by V(x) = log(1+2™Mz) for some matrix M > 0, and with S$ = {z: ||z|| < k} 
for some scalar k. 


(c) Find an explicit V, S for A = CS ‘i 


0 4 (the matrix used in Exercise 2.6). 
2.13 Consider the nonlinear state space model on the real line, 


__ Aer 
1A e 


= — tanh(z/2) 


(a) Sketch f as a function of x, and from this plot explain why x® = 0 is an equilibrium, and this 
equilibrium is globally asymptotically stable. 
(b) Find a solution to the Poisson inequality (2.39): (VV, f) < —c+7, with c(x) = x? and 7 < oo. 
You might try a polynomial, or a log of a polynomial of |x|. See if you can find a solution with 
7=0. 
(c) Find a solution V to Foster’s criterion (2.71), with S = |—k,k] for some k > 0. Also, explain 
why Ts() is not finite valued using S = {0} (that is, k = 0). 

2.14 Suppose that one wants to minimize a C! function V: R" > R,. A necessary condition for a 
point z° € R” to be a minimum is that it be a stationary point: VV(ax°) = 0. 


Consider the steepest descent algorithm fa = —VV(z). Find conditions on the function V to en- 
sure that a given stationary point x° will be asymptotically stable for this equation. One approach: 
find conditions under which the function V is a Lyapunov function for this state space model. We 
will return to this topic in Section 4.4. 
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2.15 Consider the nonlinear state space model on the real line, 
fe = f(x) =—-2* 


(a) Sketch f as a function of x, and from this plot explain why x® = 0 is an equilibrium, and this 
equilibrium is globally asymptotically stable. 

(b) Find a solution to the Poisson inequality (2.39) with c(x) = x7: (VV, f) < —c+7 with 7 < oo. 
You might try a polynomial, or a log of a polynomial in of |a|. See if you can find a solution with 
7=0. 

(c) Find a bounded solution to Foster’s criterion (2.71). 


2.16 Consider the Van der Pol oscillator, described by the pair of equations 


d 
qk = 2 2.72 
ox —(1 — 2?)ro — 21. ( : ) 


(a) Obtain a linear approximate model Le = Az around the unique equilibrium z° = 0. 
(b) Verify that A is Hurwitz, and obtain a quadratic Lyapunov function V for the linear model. 


(c) Show that V is also a Lyapunov function for (2.72) on the set Sy(r) defined in (2.28), for 
some r > 0. That is, show that the drift inequality (2.37) holds whenever x; € Sy(r). 


Conclude that the set Sy(r) C Q = the region of attraction for x°. 


(d) Can we find the entire region of attraction? Take a box around the origin B = {x : —m < 
ry <m, —m < x2 < m} for some integer m (definitely larger than 1, but less than 10 will suffice). 
Choose N values {2*} C B (say, N = 10°), and simulate the ODE for each i, with xo = 2’, to test 
to see if 2; € Sy(r) for some t < oo, and hence 2’ € . 


Why does entry to Sy(r) guarantee that xo is in the region of asymptotic stability? 


2.17 Inverted pendulum with friction. Consider the pendulum with applied force u, and “damping 


force” b0: 
1 
an — fi. = me 
vA qr ht) = gsin(xz ) — ucos(2x1) 
u 


where « = (0,0)T, and a,b > 0. Note that the location of @ = 0 is now at the top, in contrast to 
what is shown in Fig. 2.8. This is because our goal here is to swing the pendulum up and stabilize 
in the unstable upward position (corresponding to 0 = 0 in this exercise). 

Envision the state space as an infinite tube: equate @ and 0 + 27n for any n. 

(a) Obtain a linearized state space model with equilibrium (x°, u©) for each possible equilibrium 
(you will find that x§ = 0 is required). Comment on the challenge for x{ = +7/2. 


(b) Obtain a linear feedback law that results in x° = 0 asymptotically stable (locally). 


You will obtain a control solution that is globally asymptotically stable in Exercise 3.10 after you 
learn a few concepts from optimal control in the next chapter. 


2.18 Linear control design for MagBall. Our goal is to maintain the ball at rest at some pre-assigned 
distance r from the magnet. 
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(a) Find u° so that x° = (r,0)T is an equilibrium: f(#°,u°) = 0. Based on the linearization (2.62), 
design a linear control law for (2.62), of the form 


u=—-Kzr = —k,2% — Kor 


with 4, = #—x° and #2 = 22. Make sure that your solution results in A— BK Hurwitz. 


(b) A difficulty with this design is that u° depends on c/m, which may not be known. Modify 
your design as follows: 


t 
u=—-Kz", where £° =(%1,%0,2')", z= [ £1 dr (2.73) 
0 


with K = [K1, Ko, K3]. This is known as proportional-integral-derivative (PID) control. Obtain a 
third-order linear state space model, and choose K3 > 0 so that the 3 x 3 matrix remains Hurwitz, 
and the transient behavior remains “good” [you decide what that means]. 


Observe that the equilibrium condition oz! = 0 implies that z{ =r. 
(c) Simulate as in Exercise 2.16 to estimate the region of attraction (you may restrict to initial 
conditions with zero velocity). 


2.19 Feedback linearization for MagBall. For systems with simple nonlinearities, there is a “brute- 
force” approach to obtain a linear model. For MagBall we may view v = u?/x? as an input, from 
which we obtain a linear system via (2.60): 


d 
“gel = v2 


d _ K 
ar oa es 


(a) As in the previous exercise, obtain a control law v = —K%, where K, and K are parameters 
chosen for stability and good transient response. 

(b) Obtain an expression for the equilibrium x° for the closed loop system using the gain K 
obtained in (a). This is obtained by setting 42 = Ofori = 1,2. 

(c) Modify your design as in (2.73): v = —Kz*. Find K3 > 0 so that the transient behavior 
remains “good”. 

(d) You will need to modify your policy in (c) so v is non-negative valued, say v = (%*) = 
max(0, K% + K3z‘). The current applied to the magnet using this policy is then 


u = 24r/b(&%) (2.74) 


Simulate, and estimate the region of attraction. You may restrict to zero initial velocity. 


How does the region of attraction change when k is doubled? « divided by 2? Do not change your 
policy! The point is to check if your solution is robust to an inaccurate model. 


Warning: recall that it is not possible to achieve convergence to x° from any initial condition. 


Note: See [199] for a survey on feedback linearization — a topic that has far more depth than is 
obvious from this example. 
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Matrix algebra: 


2.20 Let A be an n x n matrix, and suppose that the infinite sum exists 
U=I+A+A?+A?+--- 


where J denotes the identity matrix. Verify that U is the inverse of the matrix I — A. 
Note that this coincides with the Taylor series expansion of f(x) = 1/(1—«) when n= 1. 


2.21 Two square matrices A and A are called similar if there is an invertible matrix M such that 
A=M'AM 


Obtain the following for two similar matrices A and A. 


(a) Show that A” is similar to A” for any m > 1, where the superscript “’”” denotes matrix 
product, 
Ab=A, A™=A(A™)), m>t. 


(b) Show that v is an eigenvector for A if and only if Mv is an eigenvector for A. 


(c) Suppose that A is diagonal (A;; = 0 if i 4 7). Suppose moreover that |Aj;| < 1 for each i. 
Conclude that J — A admits an inverse by applying Exercise 2.20. 


2.22 Matrix exponential. Compute e“ for all t for the 2 x 2 matrix 


1 0 0 1 
jucew,, D2 ga[® 9) 


The notation is intended to be suggestive: J? = —I. 


It is not difficult to obtain a formula for A” for each m, as required in the definition (2.45). With 
a <0 and 6 #0, describe the solution to fa = Az with non-zero initial condition. 


2.9 Notes 


The notion of “state” is flexible in both control theory [15] and reinforcement learning [338, 337]. 
The motivation is the same in each field: for the purposes of on-line decision making, replace the 
full history of observations at time k by some finite dimensional “sufficient statistic” x(k). One 
constraint that arises in RL is that the state process must be directly observable; in particular, the 
belief state that arises in partially observed MDPs (Markov decision processes) requires the (model 
based) nonlinear filter, and is hence not directly useful for model-free RL. In practice, the “RL 
state” is specified as some compression of the full history of observations — see [338, Section 17.3] 
for further discussion. 

For more on linear models see [81, 7], and [118] for more advanced and recent material. 

Textbook treatments on Lyapunov theory can be found in [45] (nonlinear) and [7, 205] (linear). 
The ECE Department at the University of Illinois had a great course on state space methods—the 
lecture notes are now available online [29]. The first section of [165] contains a brief crash-course on 
Lyapunov theory, written in the style of this book, and with applications to reinforcement learning. 

Poisson’s inequality (2.31) is far removed (roughly two centuries) from the celebrated equation 
introduced by mathematician Siméon Poisson. The motivation back then was potential theory, as 
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defined in theoretical physics. About one century later, Poisson’s equation arose as a central player 
in studying the evolution of the density of Brownian motion (a particular Markov process). The 
terminology Poisson inequality and Poisson equation is today applied to any Markov chain, with 
generator playing the role of the Laplacian. The generator takes any function h: X > R to a new 
function denoted Ah. In particular, the deterministic state space model (2.21) can be regarded as 
a Markov chain [257], and the associated generator is defined as 


In this notation, (2.31) becomes AV < —c+7%. 
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Chapter 3 


Optimal Control 


The chapter surveys optimization techniques for design of the feedback loop in Fig. 2.1. 

The feedforward component of the input is often designed based on optimization techniques, 
but without the dynamic programming equations that are the focus of this chapter and much of RL 
theory. Keep the larger feedback loops in mind when you apply RL to real world control problems. 


3.1 Value Function for Total Cost 


To begin, we recall some notation: x(k) is the state at time k, which evolves in a state space X; 
u(k) is input at time k, which evolves in the input (or action) space U (the sets X and U may be 
Euclidean space, a finite set, or something more exotic). There may also be an output y as shown 
in Fig. 2.1, but it is usually ignored in this chapter. The input and state are related through the 


dynamical system (2.6a): 
x(k + 1) = F(a2(k), u(k)) (a1) 


where F: X x U > X. 
Design of a state feedback policy u(k) = (a(k)) is based on a cost function c. It is assumed 
throughout that it takes on non-negative values: 


c:XxU> Ry 


The total cost J associated with a particular control input wu a Ul0,oo) 18 defined by the sum 
(oe) 
J(u) = J 5 c(a(k), u(k)) 


k=0 


The value function is defined to be the minimum over all inputs, which is a function of the initial 
condition: 


J*(x) = min) c(a(k),u(k)), «(0)=areEX. (3.2) 
k=0 


The goal of optimal control is to find an input sequence that achieves the minimum in (3.2). We 
settle for an approximation in the majority of cases. 


53 
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Why should we care? It is rare in our every-day lives that we think about solving a decision 
problem over an infinite horizon. It is favored in the control theory literature because an optimal 
policy often comes with stability guarantees: Thm. 3.1 implies this identity for the optimal input- 
output process: 
J*(a*(k)) = c(a*(k), u*(k)) + J*(a*(k + 1)) 

which is a version of Poisson’s inequality (2.31) with 7 = 0 (and inequality replaced by equality). 
Mild conditions laid out in Prop. 2.3 then imply that x° is globally asymptotically stable under the 
optimal policy. 

What’s more, once you understand the total cost formulation, other standard optimal control 
objectives can be treated as special cases. This is explained in Section 3.3. 


Under our assumption that c is non-negative, the value function is also non-negative. Below 
are minimal assumptions to ensure that J* is finite: 


(i) There is a target state «© that is an equilibrium for some input u®: 
Ei \= a" 
(ii) The cost function c is non-negative, and vanishes at this equilibrium, c(x°, u“) = 0. 
(iii) For any initial condition x9, there is an input sequence up and a time T° such that with this 
initial condition and this input we have 2(T°) = x° 


Condition (iii) is a weak form of controllability. Under these three assumptions it follows that 
J*(x) < co for each x. 


Example 3.1.1. Linear Quadratic Regulator 


The Linear Quadratic Regulator problem refers to the special case of linear dynamics (2.13), with 
quadratic cost: 

c(z,u) = 2'S2+ul Ru (3.3) 
It is always assumed that S > 0 (positive semidefinite) and R > 0 (positive definite). If there is one 
policy for which J* is finite valued, then the value function is quadratic: J*(x) = 27M*x where 
M* > 0. The optimal policy is obtained by linear state feedback: *(a) = —K*x for a matrix k* 
that is a function of M* and other system parameters. A bit more on this special case is contained 
in Section 3.6, where it will be clear why we impose R > 0. : 


3.2 Bellman Equation 


Let x = x(0) be an arbitrary initial state, and let k,, be an intermediate time, 0 < ky, < oo. We 
regard J*(x(km)) as the cost to go at time ky»: This is the optimal total cost over the remaining 
life-time of the optimal state-input trajectory. 

Based on this interpretation we obtain, 


km—-1 love) 
F(a) = min | 7 e(a(h),u(k)) +S) e(w(k),u(h)) 
moe! k=0 k=km 
kal oo 
ee eka) ates ¥ calo(e) ts))| 
J*(x(km)) 
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— Optimal trajectory starting from x at time t = 0 


pees Optimal trajectory starting from xy, at time ky, 


Figure 3.1: Principle of optimality: if a better control existed on [km,oo), we would have chosen it. 


which gives the functional “fixed point equation”: 


hein =i 
I e\= Paw d c(a(k), u(k)) + J*(2(km)) (3.4a) 


As a consequence, the optimal control over the whole interval has the property illustrated in Fig. 3.1: 
If the optimal trajectory passes through the state zm at time x(k) using the control u* = uY9,00); 
then the control U hem ,00) must be optimal for the system starting at x,, at time k,,. If a better u* 
existed on [ky,,00), we would have chosen it. This concept is called the principle of optimality. 
Analysis in continuous time proceeds by letting kj | 0 to obtain a partial differential equation. 
Theory is far simpler in discrete time: we set k,;, = 1 to obtain the following celebrated result. 


Theorem 3.1. Suppose that the value function J* is finite valued, and an optimal input u* 
solving (3.2) exists. Then, the value function satisfies 


Ia) = min{ e(x, u) + J*(F(x,u))} (3.5) 


Suppose that the minimum in (3.5) is unique for each x, and let p*(x) denote the minimum. Then, 
the optimal input is expressed as state feedback, 


u*(k) = b*(2*(k)) (3.6) 
O 


Equation (3.5) is often interpreted as a fixed point equation in the unknown “variable” J*. It 
goes by the name Bellman equation or dynamic programming (DP) equation: the two terms are 
used interchangeably in this book. 


Q-function. The function of two variables within the minimum in (3.5) is the Q-function of 
reinforcement learning: 


Q*(x,u) = c(x,u) + J*(F(x, u)) (3.7a) 


The Bellman equation is expressed 
Jo) =a O(a.) (3.7b) 
and the optimal feedback law any minimizer: 


b*(x) € argminQ*(z,u), crEX (3.7c) 
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The Q-function solves the fixed point equation, 
O(a, 1) = e(n, a) + OE (a, 2))., reEeX,ueu (3.7d) 


where Q(x) “ min, Q(z, u), « € X, for any function Q. 


Eqn. (3.7d) is obtained by eliminating J* in (3.7a) via the identity (3.7b). Applications of this 
dynamic programming equation appear throughout the book, starting in Section 3.7. 

The term dynamic programming also refers to recursive algorithms designed to obtain the 
solution to a Bellman equation. However, dynamic programming is only practical when X is finite, 
or the system has special structure (such as for linear state space models and quadratic cost). 
The two most popular algorithms are value iteration and policy iteration (also known as policy 
improvement). 


3.2.1 Value iteration 


Given an initial approximation V® for V* appearing in (3.5), a sequence of approximations is defined 
recursively via 


Vv" (¢) = min{ ¢(x, u) +V"(F(z,u))}, rex, n>0 (3.8) 


Recursions like this to solve fixed point equations are generally known as successive approximation. 
In Exercise 3.5 you will establish the following interpretation: 


n 


V"(@)= min{ > c(a(k), u(k)) + V9 (a(n + 1))} , eOlHreX. (3.9) 
Morn] e=0 
The VIA is convergent under very general conditions: for each x 
lim [V" (a) —V"(a°)] = J*(a) 
Here is the simplest result of this kind: 


Proposition 3.2. Consider the VIA under the following assumptions: 
(i) The state space X and input space U are finite. 
(ii) The cost function c is non-negative, vanishes only at (x°,u®), and J* is finite valued. 
(iii) The initialization V° is chosen with non-negative entries, and V°(x°) = 0. 


Then, there is no > 1 such that 


Weed (el, @eXr;, RS 19 


Proof. Let b* be an optimal policy, and let ng > 1 denote a value such that (x*(k), u*(k)) = (2°, u®) 
for k > np. Such an integer exists because J™ is finite valued. 
We have from (3.9) 


n-1 
ya) <= y c(x(k), u(k)) + V°(x(n)), when u(k) = b*(a(k)) for eachk, (0) =x EX. 
k=0 
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The right hand side is precisely J*(x) + V°(x°) = J*(x) for n > no. And for such n, the inequality 
must then be equality due to (3.9) and optimality of *. O 


The VIA generates a sequence of policies, defined as minimizers in (3.8). For each n > 0, 


p(x) € arg min{ (x, %) +V"(F(z,u))}, creEX (3.10) 


Each of these policies is stabilizing, subject to an assumption on the initial value function: The 
function V° is non-negative, and satisfies for some 7 > 0, 


min{¢e(«, u) + V°(F(z,u))} < V°(a) +7, rEX (211) 


This can be interpreted as a version of Poisson’s inequality (2.31) for the system controlled using 
the policy °: 
V°(F(a, 6°(x))) < V°(e) — e(@, o°(a)) +7 


Proposition 3.3. Suppose that (3.11) holds, with V° non-negative. That is, 


{ela +V°F@W)} poy SVR) HM, BEX 


Then a similar bound holds for each n: 


{e(x,u) + V"(F(z, u))} <V"(x) +7, 5 rex 


u=h" (x) 


where the upper bounds are non-negative and non-increasing: 
72% 27, 277° 
The conclusions of Prop. 3.3 are most interesting when 7 = 0 so that, for each n, 


{e(x, u) +V"(F(a, w))}I, < V"(z) 


=" (x 
The following bound then follows from Prop. 2.3, 
ida a Add LEX, 
where J” is the total cost using policy ”. 
Proof of Prop. 3.3. Denote for n > 0, 
B'(z)=V"R(@)-V%(t), = sup B"(z) 

The VIA recursion (3.8) gives, for any x, 

min{e(«, u) + V"(F(a,u))} =V"*tl (2) = V" (x) + B"(z) 


For this reason, the function B” is known as the Bellman error (associated with the approximation 
of J* by V”). The Lyapunov bound then follows from the definition of ”: 


{e(x, u) + V"(F(a, u))} 


wore) — intel, w) + V"(F@,4))} SV") +I 
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It remains to obtain bounds on {7,,}. First observe that by (3.11), 


V1 (2) < {e(x, u) + V°(F(a,u))} < V(x) +7 


u=60(c) 


from which we conclude that Bo(x) < 7 for all x, and hence also 7) < 7. 
The next steps are similar: for n > 1, 


Vv" (2) < {e(a,u) +V"(F(a, u))} 


u=or—l (x) 
Va {e(x, u) + Vv" (F(a, u)) } 


u=b"-1(x) 


On subtracting, 


B"(x) =V"t1 (x) —V"(z) < {V"(F(z,u)) — VV"! (F(z, u)) } 


Hence 7j,, = sup, B"(x) < 7,_1 as claimed. O 


3.2.2 Policy Improvement 


The Policy Improvement Algorithm (PIA) starts with an initial policy @°, and updates recursively 
as follows: For policy )”, the associated total cost is computed: 


Pig sae S° cel), utk)); u(k) = o"(a(k)) for eachk, 2(0)=xEX (3.12) 
k=0 


This solves the fixed-policy Bellman equation, 


J” (2) = {e(z,u) + J"(F(z, u))} pits (3.13) 
This is followed by the policy improvement step to obtain the next policy: 
"tl (x) € arg min{c(x, u) + J"(F(x,u))}, rEX (3.14) 


The proof of the following is similar to the proof of Prop. 3.3. The proof that the value functions 
are non-increasing is an application of Prop. 2.3. 


Proposition 3.4. Suppose that p° is stabilizing, in the sense that J° is finite valued. Then for 
each n > 0, 


{e(x, u) + J”(F(z,u))} PACs ee xzEX 


< 
u=prtt(c) ~ 


Consequently, the value functions are non-increasing: 


Pin > Fa) eI @y> 
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3.2.3. Perron-Frobenius theory — a gentle introduction* 


One step in the PIA requires a subroutine: how to solve the fixed policy dynamic programming 
equation (3.13)? The purpose here is to present an efficient approach to computing the value 
function when the state space is finite; this is also a prelude to theory for Markov chains, as well 
as spectral graph theory that arises in ML. 
Let’s return for a moment to the setting of Section 2.4, where we considered the state space 
model without control: 
x(k +1) =F(2(k)), k>0 


and associated value function (2.24), recalled here: 


I(x) = 5 c(a(k)),  2(0)=2rEX 


k=0 


The value function satisfies a version of (3.13), which in equation (2.25) is expressed in the simpler 
form 


I(x) =c(x) + J(F(a)), «eX (3.15) 


Much of Perron-Frobenius theory concerns calculation of fixed-point equations involving ma- 
trices. To apply this theory, we need a matrix. Assume the state space is finite, and to simplify 
notation suppose that the state space is a sequence of positive integers: X = {1,2,3,...,N} for 
some NV > 1. Assume that «© = N, which satisfies F(N) = N by the equilibrium property. Assume 
also that c(V) = 0, and that the value function is finite valued. 

Define an N x N transition matrix P as follows: P(i,j) = 0 or 1 for each 7 and j, and 
P(i,j) = 1 means that 7 = F(z). Consequently, the ith row of P has exactly one non-zero element. 
In particular, P(N, N) = 1 characterizes the equilibrium property. With this notation, we have a 
new way of thinking about the fixed policy dynamic programming equation: 


N 
Ji) =e) + 5° PGA|IG), SiN (3.16) 
j=l 


Now, dear reader: please accept a new way of thinking about the notation: 
J=é@4+PJ (3.17) 


That is, we view J is an N-dimensional column vector whose ith element is J (i), and the definition 
of Cis analogous. I am pleading with you here, because I know from experience that young graduate 
students feel uncomfortable going from (3.15) to (3.17). 

At first glance, it seems clear that we can solve this equation by inversion: 


J=(1-Py le 


The problem however is that IJ — P is never invertible. To see this, take v € R' with constant 
non-zero entries (u(i) = v(1) 4 0 for all 7), and observe that by construction of P, 


Puv=v 


That is, v is an eigenvector of P with eigenvalue 1. It follows that v is in the null space of J — P. 


Pre-publication draft -- March 25, 2022 


CHAPTER 3. OPTIMAL CONTROL 60 


You might try a pseudo inverse. In Matlab, this is computed using the command J=(I-P)\c. 
But if you do this, you may not understand what is going on behind Matlab’s curtain. Also, how 
do you know if you have obtained the boundary constraint J(x°) = 0? 

Here is one ingenious idea behind the theory of Perron and Frobenius: choose two vectors 
s,v € R% with non-negative entries, and satisfying, 


Pi,j)28@)vG) 1 Stgj<N (3.18) 


This is called a minorization condition. Letting s ® v denote the “outer product” of these two 
vectors, this is equivalently expressed 


P(i,j) 2 [s@v]i,j) 1sijeNn 


We then play with the fixed point equation: 
é=[1-P|J=[I1-(P-s@v)|J—-[savlJ 
and note that the final term is just a constant times s, represented as a column vector: 


[s@v]J = ds, 6=S > v(s)J() 
j 
so that 
é@=([I-(P-—s®v)|J—6s (3.19) 


Under very mild conditions we can invert the matrix multiplying J in (3.19). The inverse is 
known as the fundamental matriz: 
[oe] 
G=[I-(P-s@v)-'=)-(P-sev)” (3.20) 
n=0 
with (P—s@v)* the kth matrix power for k > 1, and with (P—s @v)° =I (the identity matrix). 
Here is where the minorization condition comes in: the matrix (P —s@v)* has non-negative entries 
for each k > 0, so that the infinite sum is always meaningful. If it is finite valued, then from (3.19) 
we obtain 


J = Gé+6Gs 
To find 6 you must apply the boundary condition for J: 


0 = J(N) = 5° G(N,k)clk) +65 GN, k)s(k) 
k k 


and then obtain 6 by division. 
Alternatively, think harder about your choice of v! Here is a simple consequence of the Perron- 
Frobenius construction: 


Proposition 3.5. Consider the state space model with X = {1,2,3,...,N}. Suppose that c: X > 
R, vanishes at the state N, and suppose that the total cost J is finite valued. Define the matrix G 
using s =v = eN (the Nth basis vector in RY). 

Then, J =Gé. That is, for each i € X, 


N co «€6UWN 
I(t) = S° GG eG) => SCP - ss @v)"i, els) 
j=l n=0 j=1 
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Proof. The minorization condition holds because P(N, N) = 1 = [s®@v](N, N), and [s®v](i,7) =0 
for all other i, 7. Applying the boundary constraint J(NV) = 0 gives 


b=) v(j)IG) = JIN) = 0 
j 
To see that G is finite valued we establish an interpretation for each term in the sum: for j < N, 


(P—s@v)"(t,j) =1{2(n) = J}, when 2(0) =7 


The left hand side is zero for any 7 and n > 1 when j = N. Let no > 1 denote an integer for which 
x(n) = N for n > no and any initial condition. Then, G can be expressed as a finite sum: 


no 
G= S"(P-s@v)” O 
n=0 


3.3 Variations 


The total cost problem (3.2) is the standard in the control literature, and opens the door to many 
other possibilities. 


Discounted cost A more popular objective within the operations research literature is the 
discounted-cost problem: 


Ia) = min) y*e(a(k), u(k)) , #0) =a EX. (3.21) 
* 0 
with y € (0,1) the discount factor. This solves the discounted-cost optimality equation 
J* (x) = min{e(x, u) + yJ* (F(z, u)) } 
The Q-function becomes Q*(x,u) “= c(x,u) + yJ*(F(a, u)), so that J*(2) = min, Q* (a, u). 
Shortest path problem Given a subset A C X, define 


TA = min{k > 1: 2(k) € A} 


The discounted shortest path problem (SPP) is defined to be the minimal discounted cost incurred 


before this time: 
Ta—1l 


J* (x) = min Ss vesh)eh), z0s=x (3.22) 
k=0 


Proposition 3.6. If J* is finite valued, then it is the solution to the DP equation 


J* (ar) = min {c(«,u) 4 y1{F(2,u) € A‘ I*(F(2,u))} . ex (3.23) 
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Proof. As in the total cost problem we begin with 


Ta—l 


J*(a) = min {c(x,u(0)) + D> y¥e(a(k), u(k)) } 
k=1 


with the understanding that YS = 0: the upper limit in the sum is equal to 0 when 74 = 1 
(equivalently, 7(1) € A). Consequently, 


Ta—1l 


F(x) = min { e(x,u(0)) + 7l{x(1) € A°} min SO oTe(a(ke), w(k))} 


peel py 


= min { (x, u(0)) + yl f{x(1) € A‘ s*(x(1))} , (1) = F(z, u(0)) 


min {e(x,u) + 7{F(e,u) € A}I*(F(e,w))} 


U 


O 


For the purposes of unifying the control techniques that follow, it is useful to recast (3.22) as 
an instance of the total cost problem (3.21). This requires the definition of a new state process x% 
with dynamics F%, and a new cost function c® defined as follows: 


F(z,u) «x € AS 


af LEA 
so that x*(k +1) = x°(k) if x(k) € A (called a graveyard set for the control system). 


(i) Modified state dynamics: F°(z, u) = 


c(z,u) xe AS 


(ii) Modified cost function: c(x,u) = 
0 ceA 


From these definitions it follows that the value function (3.22) can be expressed 
[oe] 
I= min) y*c*(2"(k), u(k)) : xe Ae 
* =0 


Example 3.3.1. Mountain Car 


Recall the Mountain Car example introduced in Section 2.7.2. The control objective is a shortest 
path problem: to reach the goal in minimal time. Let c(z,u) = 1 for all 2,u with x 4 z®', and 
c(z®"",u) = 0 for any u. The SPP can be expressed as a total-cost optimal control problem based 
on the model and this cost function. The optimal total cost (3.2) is finite for each initial condition, 
and the Bellman equation (3.5) becomes 


J* (a) =1+minJ*(F(@,u)), 2.< 2" 


and with J*(z*"', 72) = 0 for any value of x2. 

Fig. 3.2 shows the value function obtained using value iteration, based on a finite state space 
approximate model (details may be found in Section 3.9.1). The total cost is relatively low with 
initial condition x9 = (z,v) satisfying v > 0.2, because the car can reach the goal without stalling, 
using u(k) = 1 whenever z(k) < 28". : 
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Finite horizon Your choice of discount factor is based I*(z,0) 
on how concerned you are with the distant future. Mo- 
tivation is similar for the finite horizon formulation: fix 
a horizon NV > 1, and denote 


N 
fe) = mit e(e(k),u(k)), 2(0Q)=a2EX. (3.24) : 
UO,N] 0.06 
k=0 mt 
This can be interpreted as the total cost problem (3.2), : 98 950 925 oot” Velocity v 
following two modifications of the state description and HOSES vente 
the cost function, similar to the SPP: Figure 3.2: J* for Mountain Car. 


(i) Enlarge the state process to 7*(k) = (a(k),t(k)), where the second component is “time” 
plus an offset: t(k) = 1(0) +k, k>0. 


(ii) Extend the definition of the cost function as follows: 


c(z,u) tT<N 


That is, c*((x,T),u) = c(a, u)1{t < N} for all x, 7, u. 
If these definitions are clear to you, then you understand that we have succeeded in the trans- 
formation: 


= min } / c*(«(k), u(k)) . <r) = (8). t= 0 (3:25) 
k=0 


However, to write down the Bellman equation it is necessary to consider all values of T (at least 
values tT < NV’), and not just the desired value t = 0. Letting J*(x,T) denote the right hand side of 
(3.25) for arbitrary values of t > 0, the Bellman equation (3.5) becomes 


I (et) = min{¢e(2, u)l{t <N}+ J*(F(z,u),t+1)} (3.26) 


The similarity with (3.8) is explored in Exercise 3.5. 

Based on (3.25) and the definition of c*, we know that J*(x,t) = 0 for t > N. This is 
considered a boundary condition for the recursion (3.26), which is put to work as follows: first, 
since J*(z,N +1) =0, 

J*(2,N) = e(x) = min C(x, u) 


Applying (3.26) once more gives, 
J*(x,N — 1) = min{c(z, u) + c(F(a,u))} 
U 
These steps can be repeated until we obtain the finite-horizon value function J*(-) = J*(-,0). 


What about the policy? It is again obtained via (3.26), but the optimal policy depends on the 
extended state: 


*(x,t) € argmin{e(z, u) + J*(F(a,u),t+1)}, TSN (3.27) 


This means that the feedback is no longer time-homogeneous:!” 


u*(k) = o*(2*(k), k) (3.28) 


MSubstituting tT (k) = k is justified because we cannot control time! 
g J 
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Model predictive control An extremely successful control technique in many fields (such as in 
manufacturing and building operations) is model predictive control (MPC). This is a slight variant 
of (3.28) to obtain a stationary policy: 


u(k) = oP (a*(k)) = b*(a(k), 0) (3.29) 


where the right hand side is defined in (3.27) using tT = 0. However, MPC is never presented as 
state feedback because the policy @™“"° is not computed and stored in memory. Rather, for each 
k, when the state x = x(k) is observed, the finite horizon optimization is performed to obtain 
the value u(k) = *(x,0) That is, the policy is only evaluated for those states that are observed 
(243, 244]. 

Nevertheless, the general optimal control theory surveyed here provides techniques to ensure 
that the total cost associated with (3.29) is finite. Denote for x € X, 


co 
Pia => c(x 
k=0 


subject to (0) = x, and u(k) = @“"°(ax(k)) for all k. 


Proposition 3.7. Consider the policy (3.29), obtained with modified objective: 


J*(#;0) = min Side k))+V°%(a(N)), 2(0)=2 EX, (3.30) 


tae k=0 
where V°: X + Rx satisfies (3.11) with 7 = 0: 
min{e(x,u) + V°(F(a,u))} < V°%(2), rEX 
U 
Then, the total cost J’°° is everywhere finite. 


Proof. From the definitions (3.9) and (3.30) we conclude that J*(x;0) = V(x) (the outcome of 
N iterations of (VIA)). Prop. 3.3 then gives a version of Poisson’s inequality 


{e(a,u) + VEO} gaimee <V(r), 2EX 


with V = V. The conclusion that J™?¢ is finite valued follows from Prop. 2.3 (the Comparison 
Theorem). O 


3.4 Inverse Dynamic Programming 


An alternative to dynamic programming is to change the problem: in optimal control we are given 
c, and then face the (often daunting) task of computing J*. Here we reverse the computational 
task: Given any function J, find a cost function c’ so that (3.5) is satisfied. 

To proceed in a way that respects our original goal, we consider the Bellman error: 


B(x) = —J(x) + min|c(2, u) + J(F(z,u))| (a.a1) 
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This is precisely the error in the Bellman equation (3.5). Based on this we obtain a solution to a 
Bellman equation, with modified cost function: 


I(x) = min{c’ (x, u) + J(F(2,u))| (3.32a) 
c? (x, u) = c(x, u) — B(x) (3.32b) 
The minimizer in (3.32a) defines a policy, denoted 7: 


7 (x) € arg min{c? (x, u) + J(F(a, u))] = arg min[c(x, u) + J(F(z, u))] (aoda) 


U U 


This procedure is known as Inverse Dynamic Programming (IDP). It is one formulation of the 
control-Lyapunov function approach to control design. Minimizing the Bellman error is a goal of 
many approaches to reinforcement learning. Motivation is provided in the following: 


Proposition 3.8. Suppose that the following hold: 
(i) J is non-negative, continuous, and vanishes only at x°. 


(ii) The function c/(x) = c(x,67(a)) satisfies the following: it is non-negative, continuous, 
inf-compact, and vanishes only at x°. 


(iii) There is a constant o satisfying 0 < @< 1, and 
B(x) = c(x,u) — c? (x, u) > —oc(x, u) for all x,u 


Let J? denote the value function under the policy 7: 


a S © c(a(k), u(k)) ' a(0)=2, u(k) = 6? (x(k)) for all k (3.34) 
k=0 


This admits the pair of bounds: 
J*(x) < JP*(x) < (1+ 0) I*(z) 


The proof of Prop. 3.8 requires a deeper look at the dynamic programming equation (3.5). The 
following is an extension of Prop. 2.4: 


Proposition 3.9. Suppose that the value function J* is finite valued, and the optimal policy * 


is stabilizing, in the sense that x*(k) > x© as k + co for any initial condition. 
Suppose that J: X > Rx is continuous, vanishes at x©, and solves 


Ia)s min{ e(«, u) + J(F(z,u))}, 2eEX (3.35) 


Then J = J*. 


Proof. We adopt the same notation as in Inverse Dynamic programming: /(2) is defined as a 
minimizer in (3.35) for each x, and J'?? the associated value function (3.34). We have the bound 
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J? < J by the Comparison Theorem, Prop. 2.3. We can also establish the following bound by 
induction on N (starting with (3.35)): 
N-1 
I(x) < min { YF e(x(k),u(k)) + T(a(ny)} 


UW] SG 


N-1 
< S > c(a*(k),u*(k)) + J(x*(N)), 20) = 2, ur(k) = b*(a*(k)) for all k 


k=0 


with the second inequality obtained because we have replaced the minimum with a specific policy. 
We have J(x*(N)) > 0 as NV > oo by the assumptions on J and ¢*, and hence J < J*. 
Putting these bounds together gives for all x, 


Ia IT" (@) <= Says F(a) 


which implies that these functions coincide, and in particular J = J*. O 


Proof of Prop. 3.8. The assumptions on J and c’ are imposed so that we may apply Prop. 2.3 for 
the state space model subject to u(k) = 7 (x(k)). Part (i) implies that J™?(x) < J(a) for all 2, 
and (iii) tells us that «© is globally asymptotically stable under this policy. We can then apply 
Prop. 3.9 to establish equality: 


n Cao mes a min) c? (a(k), u(k)) 
k=0 


As in the proof of Prop. 3.9, the right hand side can only be increased by replacing the minimum 
with the optimal policy: with x(0) = z, 


FPP (a) <Y/c7 (a*(k),u*(k)) S (1+ a) I*(a) 
k=0 


where the second inequality uses c’ < (1+ @)c. O 


3.5 Bellman Equation is a Linear Program 


One approach to control design is to introduce a family of candidate value function approximations 
{J® : 6 € R%}, and compute the parameter 6* that minimizes the Bellman error, such as through 
minimizing the mean-square criterion (2.53). A significant challenge is that the loss function I'*(J*) 
is not convex, even when J° depends linearly on 0. 

We obtain a convex optimization problem that is suitable for RL implementation by applying 
a common trick in optimization: over-parameterize the search space. The first step is to regard J* 
and Q* as independent variables, and regard (3.7a) as a linear constraint. Following this approach 
we obtain a linear program that lends itself to RL algorithm design. 

Naturally, we obtain a finite-dimensional linear program only if the state space and action space 
are finite. In this case, as part of the algorithm we choose a weighting function v, and denote for 
any candidate approximation J, 

(v, J) = S° viz) J(2) 


LEX 
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It is assumed that v(x) > 0 for each x. In most cases this will be a probability mass function 
(pmf), meaning that in addition we assume 5°, v(x) = 1. 

Even when the state space is not finite, say X = R”, we continue to assume that v has finite 
support {x* :1<i< M}. In this case the definition becomes 


(vy, J) 20 v(x") J(z') 
i 
Prop. 3.10 states that the Bellman equation can be cast as a linear program. 


Proposition 3.10. Suppose that the value function J* defined in (3.2) is continuous, inf-compact, 
and vanishes only at x®. Then, the pair (J*,Q*) solve the following linear program in the “variables” 


(J,Q): 


— (val) (3.36a) 
s.t. Q(x,u) < c(z,u) + J(F(z, u)) (3.36b) 
Q(x,u) > J(x), crEX, u€ U(z) (3.36c) 
J is continuous, and J(x°) = 0. (3.36d) 


The linear program will be called the dynamic programming linear program (DPLP). It reap- 
pears in Section 5.5 as eq. (5.62), followed by several approaches to approximately solve a DP 
equation. 

We can without loss of generality strengthen (3.36b) to equality: Q(x,u) = c(z,u)+ J (F(z, u)). 
Based on this substitution, the variable @ is eliminated: 


max (v, J) (3.37a) 
s.t. c(z,u) + J(F(a,u)) > J(2), creEX, ue U(z) (3.37b) 
This more closely resembles what you find in the stochastic control literature (see [10] for a survey). 


The more complex LP (3.36) is introduced because it is easily adapted to RL applications. 


3.6 Linear Quadratic Regulator 


For the linear system model (2.13), with quadratic cost (3.3), it is known that the value function is 
quadratic, J*(xz) = «7M*~x for each x. The Q-function is also quadratic: combining the definition 
(3.7a) with the system model (2.13a), 


Q*(x,u) = c(a,u) + J*(F'x + Gu) (3.38) 


A more explicit quadratic representation can be found in eq. (3.41). 
The optimal policy is obtained by minimizing the Q-function over u, which is easily done via 
the first-order condition for optimality: 


0 = V,,Q* (a, u) = 2Ru* + 2G7M* (Fa + Gu*) 
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Under the assumption that R > 0 it follows that R+G™M*G > 0 (and hence invertible). The 
minimizer u* = d*(x) defines the optimal policy as linear state feedback: 


*(x) =—-K*x with K* =[R+G™M*G) '!G'M*F (3.39) 


To obtain * we must compute the value function, since the gain K* depends on M* > 0. This 
matrix solves a fixed-point equation known as the algebraic Riccati equation (ARE): 


M* = FT (M* — M*G[R+ G™M*G]'G™M*) F+ S$ (3.40) 


The derivation of the ARE for the LQR model in continuous time is found in Section 3.9.4. 
To understand the LP (3.36) in this special case, it is most convenient to express all three 
functions appearing in (3.36b) in terms of the variable zT = (aT, uT): 


J*(a,u) = z1M?*z Q* (x, u) = 27M? z ele, a) = 2 Me (3.41a) 
y _ [M* 0 eo 
My* = | ; Mo = F e (3.41b) 
* FIM*F FT™TM*G 
Or. c 
M® = Me4 ae el (3.41c) 


Justification of the formula for M®@* is contained in the proof of Prop. 3.11 that follows. 


Proposition 3.11. Suppose that J* is everywhere finite. Then, the value function and Q-function 
are each quadratic: J*(x) = a™M*x for each x, where M* > 0 is a solution to the algebraic Riccati 
equation (3.40), and the quadratic Q-function is given in (3.41c). The matrix M* is also the 
solution to the following convex program: 


M* € argmax trace (M) (3.42a) 


a (f+ (GN SMS) LY 8 


o ritlamr ctucl=|lo o (3.42b) 


where the maximum is over symmetric matrices M, and the inequality constraint (3.42b) is in the 
sense of symmetric matrices. O 


Despite its linear programming origins, (3.42) is not a linear program: it is an example of a 
semidefinite program (SDP) [364]. 


Proof of Prop. 3.11. The reader is referred to standard texts for the derivation of the ARE [7, 77]. 
The following is a worthwhile exercise: postulate that J* is a quadratic function of x, and you will 
find that the Bellman equation implies the ARE. 

Now, on to the derivation of (3.42). The variables in the linear program introduced in Prop. 3.10 
consist of functions J and Q. For the LQR problem we restrict to quadratic functions: 


J(a)=a'Mex, Q(z, u) = 27M@z 


and treat the symmetric matrices (M, M®) as variables. 
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To establish (3.42) we are left to show 1) the objective functions (3.36a) and (3.42a) coincide 
for some v, and 2) the functional constraints (3.36b, 3.36c) are equivalent to the matrix inequality 
(3.42b). The first task is the simplest: 


with {e’} the standard basis elements in R”, and v(e*) = 1 for each i. 

The equivalence of (3.42b) and (3.36b, 3.36c) is established next, and through this we also 
obtain (3.41c). In view of the discussion preceding (3.37), the inequality constraint (3.36b) can be 
strengthened to equality: 

Q* (x, u) = c(a,u) + J*(Fx+ Gu) 


It remains to establish the equivalence of (3.42b) and (3.37). 
Applying (3.38), we obtain a mapping from M to M®. Denote 


7_|{[M 0|) 3_|F G 
w=(F ol» 2=|p 


giving for all x and z7T = (aT, uT), 
J(x) =a'Mz = 2™M7z, J(Fa + Gu) = z1E'MYEz 
This and (3.38) gives, for any z, 


21M@z = Q(a,u) = c(z,u) + J(F 2 + Gu) 


= 21M6z4+ 2TETM/Ez 


The desired mapping from M to M®@ then follows, under the standing assumption that M®@ is a 
symmetric matrix: 


We wees = F * pee Bee 


0 R G'IMF G™MG 
The constraint (3.37) is thus equivalent to 
z1M%z = J(x£) < Q(a,u) = 27M@z, for all z 


This is equivalent to the constraint M7 < M®, which is (3.42b). O 


3.7 A Second Glance Ahead 


In Section 2.5 it was only possible to talk about RL within the framework of policy selection within 
a parameterized family. We can broaden our discussion now that we know something about optimal 
control. 

Let’s turn to a common approach to RL in which we choose a parameterized family of functions 
{Q° :@ © R“%, and seek among them an approximation to the Q-function Q* defined in (3.7a). We 
frequently use a linear parameterization: 


Q°(x,u) = OTY(a,u), OER? (3.43) 
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in which w;: X x U > R is called the ith basis function, 1 <i < d. For any 6 we obtain a policy by 
mimicking (3.7c): 
°(x) € arg min Q?(z, wu), SEX (3.44) 
U 


Consider how we might approximate PIA from Section 3.2.2: given an initial policy °, generate 
a sequence of policies {"} and parameter estimates {0,,} as follows 


(i) Obtain a parameter 0, to achieve the approximation Q®" =~ Q,, where the latter is the 
fixed-policy Q-function that satisfies 


Qn(x, u) = e(a,u) + Qn(zt, ut), gt =F(a,u), ut = o"(a7) (3.45) 
(ii) Define a new policy "+! = p*. 


Variations of this approximation of PIA are explored in greater depth in Section 5.3, and in Part II. 


An alternative is to build an approximation algorithm based on the dynamic programming 
equation for the Q-function introduced in (3.7d): 


Q*(x,u) =c(@,u)+Q*(F(a,u)), — Q*(@) = min Q*(a, u) 
which admits the model free representation, for any state-input trajectory: 
Q* (x(k), u(k)) = e(a(k), u(k)) + Q*(a(k + 1) (3.46) 
Just as in (2.52), for any approximation Q we can observe the Bellman error: 
DesslQ) “ —O(a(k), w(k)) + e(w(h), w(k)) + Ola(k +1) (3.47) 


This is zero for every k if Q = Q. 

Q-learning is broadly defined as algorithms to choose 6* so that |Dz41(Q®)| is in some sense 
minimized over all 6, based on observations of the system for k = 0 to N. The first approach that 
might come to mind is to mimic the mean-square criterion (2.53): 


ja : 
r(0) = WV ys [Pri1(Q*)] (3.48) 
k=0 


The LP approach of Section 3.5 suggests alternatives, and other approaches are investigated in 
Chapter 5. 


3.8 Optimal Control in Continuous Time* 


Hopefully it was made clear in Section 2.3.4 that calculus can bring clarity to theory of state space 
models. For example, the discrete-time counterpart of Fig. 2.2 is far less enlightening. Further 
motivation was given in prior examples and in Section 2.6: in many cases the system model is 
based on laws of physics. The value of calculus is even greater when we come to optimal control, 
since it is frequently simpler to pose and approximate optimal control problems in continuous time. 
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The nonlinear state space model was previously introduced in (2.19), and repeated here for 
convenience: ae = f(x,u). The total cost J associated with a particular control input u = Uj0,00) 


is defined by the integral 
jt | Mien cevii 
0 


where c is a scalar-valued function of (x, u) as before, and as before our goal is to minimize J over 
all inputs. This will require assumptions if we expect J to be finite. A minimal assumption is that 
there is a target state «© that is an equilibrium for some input u*, and the cost vanishes at this 
equilibrium: 

ia = 0; aw j= 


The value function is defined as in (3.2): 


J (a) = mini | Cia) at), w= oe Xx. 


Ul0,00) 
Under general conditions, it satisfies a differential equation known as the Hamilton-Jacobi-Bellman 
(HJB) equation. Its derivation proceeds as in the construction of the Bellman Equation in Sec- 
tion 3.2. 
Let x be an arbitrary initial state, and let t,, be an intermediate time, 0 < tm < oo. As in the 


discrete-time development, we regard J*(z;,,) as the cost to go. Based on this interpretation we 
obtain, 


bi ee) 
J*(z) = min [/ C(Xz, Ut) art | e(are, us) at] 
Ul0,00) LO tins 
brit [ee] 
= min (| C(t, ue) dt ++ min (| e(sre, ur) dt) | 
Ulo,tm] ‘JO Ultm,co) \JI tm, 
SS 
J*(Ltm) 
That is, 
tm 
F(x) = min | | cart) dt+ I*(1,)], 29 =0 (3.49) 
Ul0,tm] 0 


This identity is the continuous-time extension of the Bellman equation, as illustrated in Fig. 3.1. 
It is known as the principle of optimality: if the optimal trajectory passes through the state x, at 
time 2;,, using the control u*, then the control Un 500) must be optimal for the system starting at 
Lm at time tm. As remarked in the caption of Fig. 3.1: if a better u* existed on [tm,oo), we would 
have chosen it. 


The HJB equation is obtained on letting t,, | 0. Define A, via, 
Ag =o, —tg9 = ty — 2 


Assuming that the value function is continuously differentiable, we may perform a Taylor series 
expansion using the optimality equation (3.49) to obtain 


J*(x) = min {c(x,uo)tm + J*(x) + VJI% (x) - Az} + o(tm) 


Ul0,tm] 
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The “little oh” notation o(tm) is recalled in Appendix A. Subtracting J*(a) from each side and 
dividing through by t,, then gives 


A 

0= min {ex uo) +VJI*(x)- as} + o(1) 
Ul0,tm] tm 

Letting t;,, — 0, we obtain o(1) — 0 by definition, and the ratio A,/t,,, can be replaced by a 

derivative: 


Ay d 


lig Fe = 71,9 = e040) 
This gives the celebrated equation: 
0 = min [c(z, u) + VJ% (2) - f(x, u)] (3.50) 


Theorem 3.12. If the value function J* has continuous derivatives, then it satisfies the HJB 
equation (3.50). Suppose that the minimum in (3.50) exists and is unique for each x, to form a 
continuous function d*. Then, the optimal control is expressed as state feedback: 


uy = O* (a7) = arg min [c(axy,u) + VI* (x7) - fez, u)] - 


The term in brackets in (3.50) has an important interpretation in terms of the Hamiltonian, 


A(z, pu) = cen) + pie u). (3.51) 
The function of two variables Q(z, wu) “H (x, VzJ* (x), u) is the Q-function of reinforcement learn- 
ing for models in continuous time. 
The following result is a version of the minimum principle of optimal control. 


Theorem 3.13. Suppose that an optimal input-state pair exists, and that the value function J* 
has continuous derivatives. Then the optimal control uy must minimize the Hamiltonian for each 
time t, 

min Ha, pF uy = A, Dp, us): (3.52) 


Eh: Oe = Vaal (Oe) O 


The standard minimum principle does not start with the HJB equation. Rather, the vector val- 
ued “co-state” process {py} is defined by another differential equation. This can provide enormous 
reduction in complexity since we do not need to compute the value function. 


3.9 Examples 


3.9.1 Mountain Car 


This is an optimal control problem with infinite state space. The steps used to obtain the value 
function shown in Fig. 3.2 are summarized here, beginning with an approximate model: 
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Quantized State Space A continuous state space model is commonly approximated through 
a partition of the state space into a finite collection of bins X = U?_,B;. This step is called 
quantization or binning. The next step is to choose a representative state x’ € B; for each i, and 
define approximate dynamics on the state space consisting of the n states X> = ee Ge eee 

In this two dimensional example it is convenient to choose bins based on a rectangular grid: for 
an integer N > 1, choose n = N?, and select Xj = {a’4 = (z',w/)T:1<i< N,1<j < N}, where 
the position and velocity values are equally spaced and satisfy 


N_z 


12=z2<<...<2% =05, tay <vcercyu 
Each state x’) belongs to a bin denoted B;,;: the bins are disjoint, with union equal to X. The 
value N = 160 was used to obtain the approximation shown in Fig. 3.2. 
The state space model with state space X, is denoted 


Lo(k + 1) = Fo(xo(k), u(k)) (3.53) 


where Xo(k) € X, and u(k) € U = {1,1} for each k. The next step is to define the dynamics (i.e., 
F) which requires some care. 

We define Fy(a2%J,u) = (2,0) for any u and j, since in this case “J = (0.5, v/) (so the car is 
parked). - 

Consider now J € X, with i << N and wu € U. Denote 2!’ (u) = F(x’, u), the two dimensional 
vector with components 


zi (u) = F, (x, u) ) vi (u) _ Fy(x", u) 


Define indices 7, and j, uniquely by the constraint at (u) € B The specification of velocity 


dynamics is straightforward: 


t4,J4° 
Fy (a, u) = y+ 


The position dynamics are modified slightly as follows: 


gett, zd (u) > 2! and z+ = 2! 
Foi (2'4,u) = ¢ z#--1, zd (u) < 2’ and z+ = 2! (3.54) 
ae else 


These dynamics are defined so that Fy,(2'4,u) # z’ for any i < N. Rationale: to avoid the 
existence of state 2, = x'” satisfying i < N and xo = F(xo,u) for any u € U. 


VIA and PIA implementation VIA will be successful for any initialization V° in this finite 
state space model. The choice V? = 0 was used in the numerical experiments described in the 
following. 

Successful implementation of PIA requires an initial policy that is stabilizing, in the sense 
that the goal is reached from each initial condition. A variant of (2.59) was chosen: for each 
L = (z,v)' € Xo, 


sign(v) z+v> Tol 
o°(x) = 
1 else 
The value Tol = —0.8 was found to work well. 
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Computation of the fixed-policy value function J” for 70 ; - 
. a 5 : Maximum Bellman Error 6,, 
the policy @” was performed using the numerical tech- 60 ia 
nique surveyed in Section 3.2.3. The fact that the com- 50 


putation was successful to compute Jo implies that the 
policy ° is stabilizing. Prop. 3.4 then implies that each 
of the polices {"} obtained using PIA is also stabilizing. 

Recall the Bellman error defined in (3.31). The max- 
imal absolute error at iteration n is denoted 


Figure 3.3: Convergence of the two 
basic dynamic programming algorithms 
The same notation is used for VIA, with J” replaced by for Mountain Car. 


V”. Plots of the maximal Bellman error as a function of 

iteration n are shown for each algorithm in Fig. 3.3. While PIA requires far fewer iterations to 
achieve B, = 0, the complexity per iteration for PIA is far greater than VIA: recall that in the 
former, the fixed point equation (3.12) must be solved to obtain J” in each iteration. 


By = max|—J" (x) + min|c(, u) + J" (F(x, u))]| 


3.9.2 Spiders and flies 


In the example illustrated in Fig. 3.4 there are 15 
“agents” (the spiders) that are cooperating to capture 
a single fly. In any “realistic” version of this example, 
the fly will hop around the grid in an unpredictable 
fashion. This realism requires a stochastic model that 7 * 
is beyond our reach at this stage in the book. For now Pe 
let’s assume that the fly is stationary, and the spiders Pe Pe 
move to their desired position without error. 

At each time k, each spider can move to the next - 
square (horizontally or vertically), or stay in its current am 
position. There are 5 possible moves per spider: hence ee Pe |e 
with 15 spiders, the input space U is massive: |U| = 5!° Pee Pe Ge) 
(over 10!° elements). The minimization step (3.8) in 
the value iteration algorithm is infeasible. 

The state space is also large: if we define the state Figure 3.4: A multi-agent control problem: 
as the location of squares inhabited by spiders, then sorepiders coppetae sa alle single iy. 
x= ere (over 10!” elements). Let’s not worry about this complexity right now. We will develop 
approximation techniques to effectively compress a complex state space. 

The size of the input space can be reduced dramatically through a modification of the definition 
of state, and a minor modification of dynamics. This system as described evolves in discrete time, 
with an increment from k to k + 1 representing TJ, seconds in “real time”. In the new system 
description we divide this sampling time by 15, and enforce that only one spider changes position 
at each time-step on this new time scale. For this we assume that there is a unique number 1,...,15 
associated with each spider, and they move in this given order. 

The new state space is represented as a product space X = Z x {1,...,15}. If a(k) = (z,m), 
then z € Z specifies the position of each of the 15 labeled spiders on the grid at time k, and m 
is the label for the spider who will make a decision at this time. At time k + 1, the new state 
x(k + 1) = (2',m’) is specified: z’ denotes the new positions of spiders after spider-m is moved 
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based on the decision at time k, and m’ = m +1, with the convention that “16” represents “1”. 
That is, m’ — 1 =m (modulo 15). 

The size of the state space is increased because we are now considering ordered subsets of the 
grid, and also because we are keeping track of the label. The number of ordered subsets of size 15 
is [Z| = 15! x (f°), giving 

[x[=|Z) x 1525 x10" 
This is massive, but remains finite. The input space U consists of only 5 elements. 

Hence if we have an approximation Q of the Q-function that defines the optimal policy via 
(3.7c), then on observing the state x = x(k) at time k, the input u(k) is obtained by minimizing 
over only five elements: 


(x) € argmin Q(z, wu) 
ueU 


3.9.3 Contention for resources and instability 


Shown in Fig. 3.5 is an example of a queueing network with multiple arrivals. The model is inspired 
by applications to semiconductor manufacturing plants in which many different types of components 
are created, but many share similar needs in terms of raw materials and processing steps. The figure 
shows a caricature of this application in which there are two final products emerging from buffers 2 
and 4. Raw material arrives to buffers 1 and 3, so that each of the two products requires attention 
at each of the two stations. 


(k) 


: eealfe2— |—- 


MR 
oooly fy 
Station 1 Xialk Station 2 X3( a3 


a(k) k) 
4 |jeeee <—_____ L3 [reeee a a 


Figure 3.5: A multiclass queueing network: at each of the two stations, the scheduling problem amounts to determine 
processing rates for each of the two materials waiting in queue. 


Queueing networks are subject to significant volatility and uncertainty: processors break down, 
arrivals are not predictable, and there may be spikes in demand. For these reasons, it may seem 
silly to consider a deterministic model for an application that so obviously fits in later chapters. 

This example will serve to show that significant insight can be obtained by first considering an 
ideal model without disturbances. There is also some theoretical justification for considering total- 
cost optimal control as an approximation for average-cost optimal control for stochastic control 
systems—more on this topic can be found in Section 7.2 and in [254]. 

It is simplest to begin in continuous time. 


Fluid model. In this simplest model, the buffer levels take on non-negative but continuous values. 
Raw material flows into buffers 1 and 3 continuously at rates a, and a3, respectively. Each of the 
two stations has a single server. For example, at station 1 the server may work on buffer 1 or 
buffer 4. During a time interval during which server 1 devotes its capacity to buffer 1, material 
flows continuously from buffer 1 to buffer 2 at rate z;. The evolution of the queue lengths can be 
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described by the linear state space model, with state q evolving in X = Ri: 
+ 
a=FautGuta 


The right derivative uae is used since state trajectories are typically piecewise smooth. The system 
parameters are a = (a1,0,0, a4)", 


2 “Oe ph. OD 
_ _ | #1 He 9 0 
F=Q4x4 GH 0 O: -e, 0 


The four dimensional input is subject to several constraints: 
> uz(i) > 0 for each i (it isn’t possible to “un-do” processing at a buffer) 
em ur(1) + u(4) < 1 and u;,(2) + uz(3) < 1 (station constraints) 
> 4(i) > 0 whenever q(i) = 0 (enforces non-negativity of the four buffers). 


Collectively these constraints define the region U(x) for each x € X = R4. 


Stabilizability and load. Is the origin an equilibrium for some input? This is an easy question 
to answer because the dynamics are linear. If q = g° = 0 for some input uf, and all t, then 


0 = of = Foi + Guf +a=Guf +a 


: . def ‘os : i : : : ° 
This gives uf = u° = —G-ta. The matrix —G~! exists and has non-negative entries. To see if 


u© © U we must consider the station constraints, expressed as Cu® < 1, where C is known as the 


ie aoe 
ee td 


constituency matrix: 


The 2 x 4 dimensional matrix obtained as the product is known as the workload matriz: 


e¥-cet= | 0 1/pHa a 
1/p2 1/p2 1/3 0 


The two dimensional vector p “Cue = —CG"a is obviously important, since p; < 1 for each 
zi is equivalent to feasibility of u°. The network load is defined to be the maximum, p. = max; /;. 
There is also a dynamic notion of workload, w; = =q;, which evolves as 


ow, = —CG7'{Gu, + a} = —Cu + p 


Letting 1, = 1 — Cu; denote the idleness rate at time t gives 


Fw, = —(1—p) +44 
The idleness rate is non-negative: 14 € R2 whenever u; € U. This explains why a strict bound is 
imposed: 


ef 


Pe = maxp; <1 (3.55) 
7 
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Under this load condition, there are many stabilizing policies: u, = (q) designed so that the 
origin is reached in finite time, from any initial condition. 
One stabilizing policy is defined by a control-Lyapunov function design: 


Ut © arg min{ £V (4) > ut © Ula) } 
where V: X — R, vanishes only at the origin. Consider for example V(q) = Sllaell?. The chain 
rule gives 
Haldell” = af Hae = Gf [Gur + a 
The control-Lyapunov function design is thus 
(x) = arg min z'’Gu = arg max 2! (—Gu) 
ueEU(z) uc€U(x) 
= arg max(ujpi(x1 — ©2) + ugp2r2 + u3p3(%3 — @4) + Usps) 

we U(x) 

This results in a policy uz = &(q@), with state dependent priorities. For example: 


u(1) =1 if pa(ae(1) — ae(2)) > page (4) 
ur(3) =1 if ws(ae(3) — a(4)) > wege(2) 


This is known as the MaxWeight policy in the queueing network literature. It is not difficult to 
show that the policy is stabilizing: V serves as a Lyapunov function, provided p, < 1. 


A well-motivated, but unstable policy. A common cost function is the total customer popu- 
lation c(x,u) = c(x) = 0, a;. This motivates the last buffer first served (LBFS) policy, in which 
strict priority is given to exit buffers: 


uz(4) = 1 whenever q(4) > 0, uz(2) = 1 whenever q(2) > 0. 


The preference to exit buffers is motivated by the desired to make c(q) decrease quickly. 
Depending upon the system parameters, this policy may be destabilizing for the fluid model, 
even when load condition (3.55) is met. An example is found when the service rates satisfy 


Hy > fl and jug > fa. (3.56) 


From the initial condition gg = x = (1,0,0,0)7 the state trajectory under the LBFS policy can be 
computed for small t: 
@ = 2 +t(a1 — M1, 1 — He, 03,0)". 


At time T, = (4, — a1)~' the first buffer empties, and we then have, for t > T,, t ~ Th, 
at = qT, =F t(0, Q1 — 2, %3, 0)T. 


At time To = T; + gr,/("2 — a1) the second buffer will drain, and all of the work will be at buffer 
three. Note that over the time interval [T),7>], buffer 1 remains empty: the arrivals to buffer 1 at 
rate a, are passed directly to buffer 2. 

The main point is, during the entire time interval [0, 72] the exit buffer at Station 1 is starved. 
Starting from time 7, an analogous situation arises, where now the exit buffer at Station 2 is 
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A A 
“ CRW Model oe Fluid Model 


Figure 3.6: Sample paths of the multi-class network. Each plot shows a simulation of the four buffer levels in the 
network using the LBFS policy, for identical initial conditions. Shown at left are results for a stochastic model, and 
on the right results for a fluid model. 


temporarily starved. We can conclude that either u,;(4) = 0 or u,(2) = 0 for all t > 0. This implies 
that 


so that buffers 2 and 4 behave as if they are located at a single station. The inequality (3.57) 
resulting from this policy is known as the virtual station constraint. The virtual load and virtual 
workload process are defined by, 


py = 1 4 23 (3.58) 
2 fa 
1) +a:(2 3) + ar(4 
we — BDH)  HB)+H4) sg, (3.59) 
[2 M4 


We can compute, 


Qy — pouz(2 a3 — pauz(4 
dye ee! L2 ( i 3— Ha t( ) = hi [ue(2) + u(4)] 
b2 ba 


If py > 1 then Ty? > —(1-— py) > 0 for all t, so that ||q|| + co as t > oo. 


A specific example is given by 
La = 3 = 10; 2 = ba = 3; ay =az = 2 


The network load is given by p. = 2(1/3+1/10) < 1, which implies that there are many stabilizing 
policies. However, py = 2(1/3+1/3) > 1, so that the fluid model controlled using LBFS is unstable 
in the strongest possible sense. 

Fig. 3.6 shows two simulations of the network model based on these parameters with common 
initial condition. The evolution of buffer levels for the fluid model are shown on the right, where it 
is evident that ||q|| - co as t + oo. The left hand side shows the behavior of a stochastic model 
in which average arrival and service rates match those of the fluid model; details can be found in 
Section 7.6. Solidarity of the two models is clear in this simulation. 


3.9.4 Solving the HJB equation 


These final examples illustrate optimal control solutions for models in continuous time. 
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Example 3.9.1. LQR 


The formulation of the linear quadratic regulator problem is as expected: 


d 

““2=Fr+Gu, x(0) = x0 

- (3.60) 
c(v,u) =a2'Sr2+ul Ru 


where S is a positive semidefinite matrix, and R > 0. Provided the value function J* is finite- 
valued, it follows that it is a quadratic function of x, and an optimal policy is obtained via linear 
state feedback. The proof of these statements is exactly the same as in Section 3.6. 

In particular, once we know that J*(x) = «™M*z for each x, then the HJB equation gives 


o* (x) = argmin{z’Sr+ ul Rut [2M*a]" [Fa + Gul} 
= arg min {u' Ru + 227M*Gu} 


where in the second equation we imposed symmetry of M*, giving [M*a]™ = «™M*. The minimum 
of this quadratic function of u is obtained by solving the linear equation: 


Vu {ul Ru + 2x7 M*Gu} a 0 


which gives o*(x) = —R7-1GTM*z. 
The closed loop dynamics are given by 


fo* = (F —GR'G™M"*|x* 
The HJB equation (3.50) gives 
4. F*(et) = —oye(2f) 
cp« (x) = e(x, 6*(x)) = 21 {S+ M*GR'GIM*}2, 2eEX 
A fixed point equation for M* is also obtained from (3.50): 


0= {a'Sx+ul Ru + [2M*2]" [Fax + Gu)} 


u=6* (2) 
=21{S + M*GR'G'M*}2+4+ 21 {2M*F —2M*GR'G™M*}a 


Substitute 227 M* Fa = 2™|M*F + F™M*]a, and after cancelling terms this results in 
0=a7 {S+ M*F+ F™M*— M*GR'G™M*}a 


Since this holds for any x, and since the matrix within the brackets is symmetric, it follows that 
M* is a positive semidefinite solution to the algebraic Riccati equation (ARE): 


0=S+F™M* + M*F — M*GR '!G™M* (3.61) 
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Example 3.9.2. Linear system with polynomial cost 
Consider a scalar integrator model, with polynomial cost: 


47 = f(z,u) =u, c(x,u) =u? + 24 (3.62) 


The HJB equation becomes 0 = min, {ut J* (x) +u? +2*}. Minimizing with respect to u defines 
the state feedback solution, 
bia) = = 5a). 
The closed loop system has the appealing form 
d d 
Huy = 37" (27) 
Substituting the formula for u* back into the HJB equation gives the differential equation 


od ae 2 
0= {ut J* (x) +u? 2} | geen) =—4(4J* (2) +24 


giving  J* (x) = +277. The unique non-negative and continuously differentiable solution is 
J*(x) = $|2|?, and hence the optimal policy is 


O*(@) = -34" (2) = —2°sign(e) 


3.10 Exercises 


3.1 Solving the ARE Consider the two dimensional linear state space model discussed below 
(2.47), now with input: 


a(z) = Fa(k)+Gu(k), F=1I+0.02 es a , G= 0.02 Hl 


Solve the ARE with c(z,u) = 27 + u? to obtain the value function and the optimal policy. Is J* 
coercive? 


3.2 Shown below is an example of a routing model as considered in computer networking courses: 


This is a network with 7 nodes, and directed weighted edges as shown. Associated with any path 
there is a cost. For example, the path 7 > 6 ~ 3 > 1 has cost 2+9+5 = 16. The goal is to 
obtain the path to node 1 with minimal cost, for each node. 

Formulate this shortest path problem as an infinite horizon optimal control problem by specifying 
the state space X, the input space U, the model F(z, u) and cost c(x,u). The “self loop” shown at 
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node 1 may help you to conceptualize your optimal control formulation. Describe VIA and PIA 
for this example, and solve the SPP using each algorithm. 

Plot the error of your estimates as a function of iteration n for various initial conditions. It is 
interesting to see what happens when V"(i) < 0 for each i 4 1 in VIA (but remember to use 
V"(1) =0, which is motivated by the representation (3.9)). 


Standard solution methods are the Bellman-Ford algorithm and Dijkstra’s algorithm. The former 
is regarded as a special case of VIA. 


3.3 Box constrained LQR Consider the optimal control problem in Exercise 3.1 in which the 
input is subject to the box constraint: u(k) € U= {ue R:-1<u< 1}. LQR theory will not give 
you the optimal policy unless the initial condition (0) is sufficiently close to the origin. Numerical 
methods are the only answer [369]. 

Use VIA to approximate the optimal policy on the bounded domain {x € R? : -10 < 2; < 10, i= 
1,2}. Compare your solution to the projected optimal policy: 


(x) = max(—1, min(1, A*z)), with K* obtained in Exercise 3.1. 


3.4 Consider the scalar optimal control problem with linear dynamics: 
x(k +1) = 2(k) — u(k) (3.63) 


with x(k) and u(k) constrained to X = R4. 

(a) With cost function c(x,u) = x? + Ru’, show that the LQR solution solves the total cost 
optimization problem (3.2). You must check that your policy is feasible, which means that 0 < 
u*(k) < 2*(k) for each k. Obtain your solution by hand (no computer, since you must obtain the 
solution as a function of R). 


In the remainder of this exercise you will consider the more general cost c(a,u) = «? + Ru’. Assume 
that p,q > 1 so that c is a convex function on R2. Linearity of (3.63) then implies that J* is also 
convex. 
(b) Write code to approximate J* and ¢* using VIA and/or PIA on a finite set X, = {10:0 < 
i<n-—1}, with nd = 10 and 6 < 0.1. Obtain plots of the value function and optimal policy for a 
few choices of (p,q, R). Include the special case p = 1 and q = 2. 
(c) You might guess that J* is approximated by J(x) = Sx*, with S > 0 and s > 1. Estimate 
the Bellman error (3.31) for arbitrary R, keeping in mind that wu is constrained: 

B(x) = —J(x) + min [c(x,u) + J(F(2z,u))], F(z,u) =x2—-u 


O<u<a 
If you find yourself stuck you may instead resort to computation: plot 6 as a function of x for 
various values of p,q, R,s,S. For a few values of p,q, R, what is the best value of s, S$? 
You might also first solve the problem for a model in continuous time to gain intuition (see Exer- 
cise 3.8). 
3.5 The similarity between (3.26) and (3.8) is explored in this exercise. 


To get started, assume that J° = 0 for the initialization of value iteration, and compute J! using 
(3.8). Verify that J!(x) = J*(x,N) for each x. Repeat: write down the formula for J? using (3.8), 
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and verify that J?(2) = J*(2,N — 1) (recall (3.26)). This procedure leads to a proof by induction 
that J*+1(2) = J*(x,N — k) for each k < N. 

Now a mild generalization: for general J°, show that the function J” obtained from the VIA 
algorithm (3.8) satisfies (3.9) (and observe the similarity with (3.4a)). The proof of (3.9) is an 
instance of the principle of optimality illustrated in Fig. 3.1. 


Continuous time: 
3.6 Consider the double integrator 7 = u. Perform the following calculations by hand: 
(a) Obtain a state space model with x, = (y, y%)7 


(b) Compute the value function 
r= min | y? + uP dt, a0) =a 
u J0 


3.7 Comparing LQR in continuous and discrete time Let’s revisit Exercise 3.1, but in its original 
continuous time setting: 


—0.2 1 1 
dg = _ 
ae = Ax+ Bu, A A Z| , B A 


Compute the value function and the optimal policy with c(x,u) = 2? + u?. How does the policy 
compare with what was obtained in Exercise 3.1? Are the value functions related (approximately) 
in any way? 


3.8 Consider the variant of eq. (3.63) in continuous time: 


fx =f(x,u)=—u, c(x,u) = 2? + Ru! (3.64) 


Estimate the following Bellman error (3.31): 
B(x): = min|e(x, u) + VJ (x) - f(x, u)] 


U 


where J(x) = Sx*, with S > 0 and s > 1. What values of s,S do you recommend? (provide an 
answer for arbitrary p,q, R). 


3.9 Optimization of the rowing game. We now return to Exercise 2.8, and design (Kp, K7, Ky) 
in (2.69) using LQR. Rather than consider the full 3N-dimensional system, we pretend that the 
signal z defined in (2.67) is entirely exogenous (rather than a function of the state). 


(a) Obtain a state space model under this idealization, with augmented state 2 = (z',v', z!"), 
and Z treated as an exogenous input: 


fo% = A'e™ + Bul + Ez with Abe RR? BE eR’. 


For this to approximate the actual dynamics, the constant v* will depend upon the feedback gains 
that determine u’. 


(b) Obtain (Kj, K7, K;) using LQR based on this (A’, B’) and with quadratic cost 
e' (2%, u) = ch(x*) + Ru? 


(a2 a) = [z* = Zz)" he Sy[v _ alte 
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in which S,,R > 0. Note that you may calculate the gains using vf = 0, but your policy uses 
non-zero values: 


ub = —K*(2" — z) — Kf2"* — KX(v' — vo) (3.65) 


(c) We now stop pretending: analyze the closed loop system in which each sculler uses (3.65), 
with z = % the average position. The N-dimensional vector input (with ith component (3.65)) has 
the form 

u=—Ka+ K®%v™ with K, K® € R?3% 


Compute the value function J: R? — R, associated with your policy, and compare it to J* using 
full state feedback (both value functions are defined with c(x,u) = >>, c'(2™,u’)). Note that each 
value function is a quadratic function of the full 3N-dimensional state, so your comparison is in 
terms of the respective matrices. One should be larger than the other. What is the “loss” from 
using limited information? You might consider 


1 
Loss = —————~trace (M — M* 
trace(M*) ( ) 
with J(x) = z™Max when v™ = 0. 
Or, propose your own definition of loss and provide justification. 


(d) Repeat (c) for several values of N, and in each case obtain the three eigenvalues in closed 
loop for your three dimensional model. Compare these to the eigenvalues for the true closed loop 
system, which takes the form 


fr= (A— BK)x+Ev™, with B, E € RN, 


and A, B, E obtained from part (a). The theory of mean field games predicts stability and approx- 
imate optimality of (3.65) when JN is large [82]. 
Note: if (A — BK) has eigenvalues in the right half plane, you may need to reconsider your choice 
of Ror Sy. 

3.10 Energy control design for the inverted pendulum [14]. We return to the control system de- 
scribed in Exercise 2.17. Once you have completed parts (a) and (b), solve the following. 


Unnormalized kinetic and potential energy are functions of the state: 
KE(z) = 4(6)? PE(x) = cos(@) — 1 


where the potential energy is relative to 8 = 0, where it is maximal. 


In the seminal paper of Astrém and Furuta, the control-Lyapunov function approach is applied 
using total energy: 

E(x) = KE(xz) + s,PE(z) 
with s, > 0. You can look at the paper, but I suggest you discover for yourself what value will 
result in a stabilizing feedback law. 


The goal is to steer energy to some target value £9. We will take Eg = 0, which corresponds to the 
upright position (as in the paper, we do not include friction in the model). You may pretend that 
g = 1 in your simulations. 
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(c) To obtain a control-Lyapunov function consistent with our goals, choose 
I(x) = 5[E(x) — Eo)” = 3[523 + sp(cos(a1) — 1))? (3.66) 
Define a feedback law $, and cost function c as follows (with R > 0 fixed): 
bs(x) = arg min{ 5 Ru* + f(x,u)-VJI(ax)} 
e(x,u) = 4 Ru? — min{ Rv? + f(x,v)- VJ (x)} 


Verify that J solves the HJB equation with this c, and choose parameters for J in (3.66) so that c 
is non-negative, with c(x,u) > 0 whenever x3 + u? > 0, x1 4 +7 /2, and E(x) # 0. 


(d) ‘Try out your control design from (c). Obtain plots as a function of time (including state-input 
trajectories, and E(z;)). Include these scenarios: 


e An initial condition xo satisfying J(ao) = 0 but rp 4 0 (you may observe the numerical 
instability I mentioned in lecture). 

e The initial condition zo = (7,0)', with two or more values of R. In the case of large R, does 
the pendulum require many swings to reach its target? 


3.11 In Exercise 3.10 you learned how to control a pendulum to obtain E(2;) > 0 as t > oo. You 
likely saw the pendulum swing around, periodically reaching x; ~ 0, and then flying away from 
this desired position. 


In this exercise you will attempt to catch the pendulum and more carefully steer it to the upright 
position. You will consider the normalized dynamics: 


0 = 6 — sin(0) + wcos(8) That is, m = g = ¢ (for convenience). 
(a) For x = (0,0)' ¥ 0, a Taylor series expansion gives 
0x O0-O+u 


Based on this linear approximation, obtain a linear state space model, and from this a feedback 
law u = —K*zx based on LQR with cost ¢(x, u) = |\a||? + ru? (you may need to experiment with r 
to obtain a good design). 


(b) Given a threshold tg > 0, consider the following policy: 


—K*zx,  ||x4|| < Te 
Ut = E 
ob” (xz) else 


where ” is the policy you obtained in the previous assignment. Global asymptotic stability may 
require a weighted norm, such as 


\|x||? = J* (2) = aT M* a where M™* solves the ARE (continuous time version). 


Conjecture conditions under which this policy results in a closed loop system that is globally 
asymptotically stable on the restricted state space X = {x € R? : —-2 < E(x) < 1} (this includes 
all states of the form x = (8,0) with |0| < 7). 


(c) Simulate! Obtain a successful control design, and provide plots and discussion. 


Note that the definition of u; doesn’t make much sense unless you convert to physical units. 
In Matlab use theta = wrapToPi(theta) (so that |0| < 7). 
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3.12 Solve an adaptation of Exercise 3.10 for the Mountain Car, which requires the following 
modifications: 


(i) The state space model (2.56) 
(ii) E(x) = KE(x) + PE(x), where kinetic and potential energy are 


KE(z) =4mv?_— PE(x) =U(z) -U(2"), w= (z,v) EX (3.67) 


with WU defined in (2.57). Obtain a formula for 4E(a,), and based on this propose a control 
law (perhaps based on a control-Lyapunov function design) that will drive E(x;) to zero. For 
simulations, use the numerical values given in Section 2.7.2. 


3.13 Obtain an energy based policy (along with an adaptation of Exercise 3.10) for MagBall, based 


on the description in Section 2.7.3. 


Some ideas to get you started: The kinetic energy of the ball as a function of state is smv?, with 


v = 2x2. We next introduce potential energy, following the arguments used to justify the two plots 
in Fig. 2.11. The right hand side of (2.60b) was obtained from Newton’s law, with force 


F(y,u) = mg — k(u/y)” 


Integrate the force with fixed u = u° to obtain the potential energy associated with this static 
input: 


u(y) == [ Fyu) de 


This function is strictly concave, with maximum at y = r, where evidently U/(r) = 0. The total 
energy is then defined by 
Eg) = 5mx5 +U(2x1) 


It is easily established that VE (x) = (—F(a1,u°), mag)". 


3.11 Notes 


There are many excellent books on theory and history of nonlinear optimal control, such as [386, 45]. 
Inverse dynamic programming and the control-Lyapunov function approach has a long history in 
the control theory literature [121, 369, 283]. It also lurks behind the curtains in the RL literature: 
to be seen in Chapter 5, minimization of the Bellman error is a common goal in RL that is closely 
aligned with the IDP approach to control 

This chapter has provided only a brief survey of the LQR problem. See the books [205, 81, 7] for 
much more theory and history, the December 1971 special issue of the IEEE Transactions on Auto- 
matic Control, and early work of Kalman [175, 176]. If you have the Control System Toolbox from 
Matlab, then you have available some brilliant tools for computation: 


lqr, lqrd: Solves the LQR problem using the data A, B, Q, and R. 
are, ared: Solves the algebraic Riccati equation in continuous or discrete time 


conv, rlocus: used to graph the “symmetric root locus” [7, 77] (worth knowing about, but 
this book is not the best reference!) 
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See [126] for a recent control-theoretic approach to accelerate value iteration. 

The linear programming approach to dynamic programming goes back to Manne in the 60’s 
(238, 106, 10, 64]. There is an on-going research program on LP approaches to optimal control for 
deterministic systems [367, 160, 161, 132, 215, 177, 68, 139, 140, 69, 349], and semidefinite programs 
(SDPs) in linear optimal control [72, 369]. 

Fig. 3.4 is taken from [42] to illustrate the potential complexity of control problems with large 
input space, and how the complexity can be reduced via a clever choice of model. The example is 
posed in a stochastic setting, so of course the fly is not frozen in place. 

The network shown in Fig. 3.5 is called the Kumar-Seidman-Rybko-Stolyar model in [254]: the 
potential instability of this model was discovered in [203], which led to further investigation in 
(308, 100], followed by a burst of interest in stability and control theory for stochastic networks. 
See [254] for more history. 

The book is entirely focused on the standard performance objectives, focusing on total cost, 
discounted cost, and average cost. This leaves out objectives such as variance-penalized objectives, 
risk-sensitive control [374], or more exotic functions of the occupancy pmf introduced in Section 5.7. 
In such cases the objective in (5.82a) is replaced with C(@) in which C is convex. See [152, 383] for 
treatments of this more general setting. 


Pre-publication draft -- March 25, 2022 


Chapter 4 


ODE Methods for Algorithm Design 


For our purposes, an algorithm is a finite sequence of computer-implementable instructions, de- 
signed to compute or approximate a policy, its performance, a value function, or related quantities. 
In algorithm design it is useful to throw away the constraints of computers, and pretend that they 
can operate with infinite clock-speed. An ordinary differential equation (ODE) will be regarded as 
an example of an algorithm operating on this imaginary computer. 

The motivation comes from two sources. First, we want to know if our algorithm will eventually 
lead to a good approximation. This is easily couched in the theory of stability of ODEs, for which 
there is a far richer theory than stability of recursions in discrete time. Secondly, once we have 
constructed an ODE with desirable properties, including stability, we can then get advice from 
experts to provide the translation from calculus to a practical recursive algorithm. In this chapter 
the translation step is performed using an Euler approximation, but it is hoped that the reader 
will experiment with more efficient ODE solvers based on an evolving theory of numerical methods 
[80]. 

Throughout the remainder of the book we apply the following steps in algorithm design: 


ODE Method. 
1. Formulate the algorithmic goal as the root finding problem, 
f(6*)=0, where f: R4¢ > R?. 


2. Refine the design of f, if necessarily, to ensure that the associated ODE is globally 
asymptotically stable: 


|au 


“0 = f(9) (4.1a) 


Qy 


3. Is an Euler approximation appropriate? 
On41 = On + Greig Ory) 5 eee (4.1b) 


See Section 4.1 for conditions under which (4.1b) is a good approximation. 
4. Design an algorithm to approximate (4.1b) based on whatever observations are available. 


87 
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4.1 Ordinary Differential Equations 


Let’s start with a question we should probably have posed earlier: what is an ODE? The question 
was taken for granted many times in the preceding pages. Up to now, the state space model in 
continuous time considered in Section 2.4.5 is the most significant example. 

In this chapter, the “state variable” often represents the output of an algorithm rather than 
anything directly related to a control system. For this reason, throughout this chapter we use 
8 = {8 : t > 0} to denote the state process for the ODE, and restrict to the Euclidean setting: 
9; € R? for an integer d. The state space model (2.36) in this notation is 


49 =£(9), 90 = 9 given, (4.2) 
where f: R? > R? is the vector field as in (2.36). Two examples with d = 1: 
f(0) = a0 and 9 = Ae" 
f(0) = 0-? and 9; = [99' —¢]7! 


The ODE (4.2) is called time-homogeneous since f does not depend upon t. See Section 3.3 for 
hints on how to apply state augmentation to create homogeneity for a model in which the vector 
field f depends on time. 

In every application of the ODE approach to algorithm design, the first step is to construct the 
vector field so that 9, converges to some desired value 6* € R®%. In particular, if 99 = 6*, then the 
solution to the ODE should stay put: 0; = 0* for all t > 0. This requires that £94 = 0 for all t, 
which by (4.2) implies that 6* is an equilibrium: £(0*) = 0. Advanced material on stability theory 
for ODEs is contained in Section 4.8. Some of this can be anticipated from the Lyapunov theory 
contained in Section 2.4.5. 

Understanding theory surrounding existence of solutions of (4.2) is the first step towards under- 
standing ODE principles for algorithm design in this chapter and in Part 2 of the book. Much as 
how the Bellman equation (3.5) is regarded as a fixed point equation, the ODE (4.2) is a fixed point 
equation in the variable 6 = {0; : t > 0}. Perhaps we can mirror the success of the value iteration 
algorithm (3.8) (an instance of successive approximation)? Writing (4.2) as 9 = 9 — £9 + (9), an 
analog would be 

optl aon — Lor 4f(9P), t2>0,n>0 


with 0° = {9 : t > 0} given as initial condition. Sorry to say, this approach is doomed to failure! 
One source of difficulty is the repeated differentiation in this recursion, which means we have to 
be very careful with our selection of 9°. Also, this recursion does not respect the requirement that 
the initial condition 69 is specified. 

The Fundamental Theorem of Calculus motivates a more sensible approach. That is, 


t 
8:= 00+ f £8.) ar, 0<t<T (4.3) 
0 


where the finite time horizon 7 is chosen for the sake of analysis. Successive approximation is 
defined as before: take an initial guess 0° = {9? :0 < t < T}, and define for n > 0, 


t 
grtt = Op +f £(9”) dr, C<t< 7 (4.4) 
0 


This recursion is known as Picard iteration. It is successful under mild assumptions: 
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Proposition 4.1. Suppose that the function f is globally Lipschitz continuous: there is L > 0 
such that for each x,y € R¢, 

IIx) — f(y) || < Ella — yll (4.5) 
Then for each 09 there exists a unique solution to (4.3) on the infinite time horizon. Moreover 
successive approximation is uniformly convergent: 


lim max_||d/ — 9;|| = 0 O 
n—00 0<t<T 


A key component of the proof of Prop. 4.1 is Grénwall’s Inequality, which commonly appears 
in the theory of stochastic approximation, as well as ordinary differential equations. Note that 
Bellman had early influence here [34], which is why Prop. 4.2 is often called the Bellman-Gronwall 
Lemma. 


Proposition 4.2. (Gr6nwall Inequality) Let a, B and z be non-negative functions defined 
on an interval [0,7], with T > 0. Assume that B and z are continuous, and the integral inequality 
holds: 


ascot f Bszeds, 0<t<T (4.6a) 
(i) The Grénwall Inequality holds: 
a4 Sat [ as Bs exp( f By. dr) ds, 0<t<T. (4.6b) 
(ii) If, in addition, the function a is non-decreasing, then 
ZS OY ex( [7 Bs ds), 0<t<T. (4.6c) 
O 


The proof can be found in Section 4.8, or if you have background in linear state space models, 
you might want to work it out on your own. Hint: first solve the problem with equality: 


t 
“y= OY +f BsZ3 ds (4.7) 
0 


You can construct a state space model, with state x; = z — a;, and because it is a scalar linear 
system you obtain an explicit solution. The solution leads to something like (4.6b), but with 
equality. 

The following simple lemma is also frequently required in ODE analysis. 


Lemma 4.3. Suppose that {y% : t € R} is a non-negative function satisfying the following: (i) 
lt — Ys| < Llt — s| for a constant L and all t,s, and (ii) [5° y%dt < oo. Then, Jim “= 0. 
— Oo 


Proof. Assumption (i) implies that for any to we have the bound 7 > 7%, — L(t — to) for to <t< 
to + V%9/L. This combined with (ii) then gives 
1 to+Ytg /L co 
— lim 72 = lim (Yo — L(t — to)) dt < dim ‘4 dt = 0 O 


t 
21 tooo |"? to—oo os 0-00 Jz, 
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4.2 <A Brief Return to Reality 


This entire chapter considers ODE approaches to algorithm design: this means design of the func- 
tion f appearing in (4.2), or design of the more exotic ‘quasi-stochastic’ ODEs for which theory and 
applications are developed in Sections 4.5 to 4.7. 

There is the inevitable translation step: any design formulated in continuous time must be 
translated to create a practical algorithm. If you have taken a first year calculus course, then you 
probably have predicted the most common approach: select a sequence of times {0 = to < ti <---}, 
and replace the derivative in (4.2) by a finite difference: with 39 = 00 given, define for each n > 0, 


-1 15 a a 
Oy ci Pegi _ din] _ £(9;,,) 

where Qnj41 = tn41 — tn > 0. The recursive nature is evident after rearranging terms: 
Stns = di, An+1f(9:, ) (4.8) 


In our final algorithm we simplify notation, writing 0, = 9;,,, and {a,} is known as the step-size 
sequence. This is known as the Euler approximation of an ODE, or simply Euler’s method. 

This approximation is successful under the assumptions of Prop. 4.1: It can be shown that 

imax ll: — 9,|| < K(L,T)a (4.9) 

where @ = max{a, : t, < T}, and L was introduced in (4.5). Grénwall’s Inequality is used in the 
proof of (4.9), which leads to upper bounds on AK(L,7) that at first appear frightening (growing 
exponentially fast in LD and 7). 

Fortunately, asymptotic stability of the ODE often implies stability of (4.8), and in this case 
we obtain the bound (4.9) with K(L,7) independent of T > 0. Thm. 4.9 is an important special 
case for applications to optimization. 


Euler approximation for a linear system. The Euler approximation for the LTI model in 
continuous time (2.44), with f(z) = Az, results in the discrete time model (2.42), with F = (1+aA) 
(with constant step-size an =a > 0). 

Consider the scalar case ay = a9, which admits the solution 3; = 69e. The Euler approxima- 
tion results in a similar solution, as a function of the initial condition: 


$14, =F", F=(1+aa) 


The approximation 9;,, = %;,, + O(a) follows from the Taylor series approximation for the expo- 
nential, (1 + aa)” = (e%)” = etn, 

If a < 0 and a < |a|~!, then the approximation holds on the infinite time interval, since both 
9,,, and 9;,, converge to zero geometrically fast, as n > 00. This bound on the step-size a may be 
regarded as a special case of a more general theory—see Thm. 4.9. O 

Those interested in a high-fidelity approximation of an ODE usually abandon the Euler approx- 
imation for more sophisticated techniques, such as the midpoint method or more general Runge- 
Kutta methods [80, 168, 384]. The update equations are more complex, but this complexity is often 
offset by the tighter approximation. However, remember in this chapter our goal is to estimate a 
stationary point: the solution 6* € R?@ to 


n+1 


£(0") =0 (4.10) 


and not accurately track solutions to the ODE (4.2). The best way to perform an ODE approxi- 
mation for this relatively modest goal is an open field for research. 
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4.3 Newton-Raphson Flow 


This section concerns an approach to ODE design, in which the goal is to solve the root finding 
problem (4.10). The parameter 6* is regarded as an equilibrium condition for the ODE (4.2). The 
problem we face here is that this ODE may not be stable in any sense. In this section we describe 
a general approach to modify the dynamics and ensure stability. 

Section 4.4 concerns root finding problems for the special case in which f = —VoI, where 
r: R¢ + R, is a loss function associated with some optimization problem. The root finding 
problem is then equivalent to the first order condition for optimality, Vel (6*) = 0. If T has nice 
properties (such as convexity), then it is not difficult to establish stability of (4.2) using Lyapunov 
function techniques (such results are surveyed in Section 4.4). The techniques in this section may 
prove useful in applications for which stability of (4.2) fails. 

Our starting point is to regard f(9,) as the “state variable”, and our next step is to define 
dynamics so that lim;,.. f(8;) = 0 for each initial condition. Under mild additional assumptions 
on f it will follow that lim;... % = 6*, which is our design objective. 

If f(9,) is a state variable, this means there is an associated vector field V: R¢ + R¢, with 


4£(92) = V(£(92)) (4.11) 
One way to ensure that f(8;) converges to zero is to choose V(f) = —f, giving 
4 £(91) = —f(92) (4.12) 


The solution is f(8;) = e~‘f(90), which converges to zero exponentially quickly. Achieving these 
dynamics would be an amazing feat! 
Well, it isn’t really so difficult. Apply the chain rule: 


£9) = A(9,)£9:, with A(O)=Of(0), OER 


where Ogf denotes the d x d Jacobian matrix with entries 


AsO) = 55 FO) (4.13) 


This means that achieving the dynamics (4.11) is equivalent to 
Hr = [AQ V(E2)) 
Application of this identity for the special case V(f) = —f results in a famous ODE: 
Newton-Raphson Flow. ; 
4.9: = —[A(:)] £02) (4.14a) 
The function on the right hand side is the Newton-Raphson vector field 


°°) = —[.A(0)]*£(8) (4.14b) 
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In most applications it is not possible to determine a-priori if the matrix A(@™) = Of (@) is full 
rank, which motivates a regularized Newton-Raphson flow: for fixed « > 0, 


4.91 = G(91)E(9) (4.15) 


where G(0) = —[eI + .A(0)™A(0)|~!A(9)T is an approximation of the pseudo-inverse of —A(Q). It is 
shown in Prop. 4.4 that solutions to (4.15) are bounded in time, provided V = ||f||? is a coercive 
function on R¢. Mild additional conditions imply that V serves as a Lyapunov function for (4.15), 
giving 


jim (d+) =0 (4.16) 
Proposition 4.4. Consider the following conditions for the function f: 


(a) f is globally Lipschitz continuous and continuously differentiable. Hence A(-) is a bounded 
and continuous matriz-valued function. 


(b) ||fl| is coercive. That is, the set {@: ||f(@)|| <n} is bounded for each n. 


(c) The function f has a unique root 0*, and A™(@)f(@) 4 0 for 6 4 0*. Moreover, the matrix 
A* = A(6*) is non-singular. 


The following hold for solutions to the ODE (4.15) under increasingly stronger assumptions: 
(i) If (a) holds then for each t, and each initial condition 
(91) = —A(9:) [eZ + A(B)TA(84)]- A(z) £84) (4.17) 


(ii) If in addition (b) holds, then the solutions to the ODE are bounded, and 


Jim A(0z)'£(8) = 0 (4.18) 
(iii) If (a)—-(c) hold, then (4.15) is globally asymptotically stable. O 


Proof. The result (i) follows from the chain rule and the definitions. 
The proof of (ii) is based on the Lyapunov function V(9) = $||f(9)||? combined with (a): 


4V (81) = —£(91)TA(8;) [eZ + A(91) (81) * A(z) £(94) 
The right hand side is non-positive when 8; 4 0*. Integrating each side gives for any T > 0, 
ie 
V(8r) = V(8o) -f £(81)"A(®4)[eZ + A(Bt)TA(B:)] 1A (84) TEs) dt (4.19) 
0 


so that V(0r) < V(8) for all T. Under the coercive assumption, it follows that solutions to (4.15) 
are bounded. Also, letting T — 00, we obtain from (4.19) the bound 


i ” £(4)TA(8,) [eZ + A(91)™A(8,)]-2A(8,)T£(B,) at < V(80) 
0 
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Boundedness of 8; and continuity of A imply the existence of B(98o) < oo such that 
[ jdt < B(9)<0o, y= || A(8z) £94)? 
0 


The functions f and A are Lipschitz continuous, and boundedness then implies that 7; satisfies the 
Lipschitz condition in Lemma 4.3. Hence lim;,., A(9;)™f(0;) = 0 as claimed. 

We next prove (iii). While this follows from Prop. 2.5, we have the ingredients for a short 
alternative proof that hopefully also offers additional insight. 

Global asymptotic stability of (4.15) requires that solutions converge to 6* from each initial 
condition, and also that 9* is stable in the sense of Lyapunov. Assumption (c) combined with (ii) 
gives the former, that lim;5.. 0 = 6*. A convenient sufficient condition for the latter is obtained 
by considering Az = 09|G(6)f(@)] |g=9*. Stability in the sense of Lyapunov holds if this matrix is 
Hurwitz (all eigenvalues are in the strict left half plane in C) [181, Thm. 4.7]. 

Applying the definitions, we obtain Az = —[eJ + M]~'M with M = A(6*)TA(6*) > 0 (recall 
that A(6*) is assumed to be non-singular). The eigenvectors of A- coincide with those of M, and 
for each eigenvector-eigenvalue pair (v,A) for M we have 


A 
A-v=A,v, ie ees 


This establishes that A- is Hurwitz. Oo 


4.4 Optimization 


Here we turn to minimization of a loss function [: R¢ > R,, for which we would like to compute 
a global minimum 
6* € arg minT'(A) 


This section contains a very brief survey of optimization theory, and ODE techniques to estimate 
@*. In particular, we establish conditions under which the steepest descent ODE is convergent: 


ao = —Vol'(9) (4.20) 


This is also known as the gradient flow. 

Recall that the gradient VI (09) at a vector 0) € R@ is orthogonal to the level set {9 € R@ : 
r'(6) =1(@)}, in the sense illustrated in Fig. 4.1. For a given time to, denote 69 = 9i,, ro = (90), 
and recall from (2.28) the definition of the sublevel set: 


Sr(ro) = {0 € R¢ : T() < ro} 


If the gradient at 99 is non-zero, as in the example shown, then the gradient flow drives the solution 
into the interior of this set: [(0:) < ro for all t > to. In particular, each sublevel set is absorbing: 
once a solution to the gradient flow enters the set Sp(ro) it can never exit. 

This intuition is the starting point towards finding conditions under which the gradient flow is 
convergent using Lyapunov function techniques. Two typical choices for a Lyapunov function are 


V(0)= 30-6"? or =~ V(8) = 4IF(@) 1 (4.21) 
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VI'(0) 


£9, = —VI(9:) 


Sr(ro) = {6 € R¢: T(0) < F()} 


Figure 4.1: Gradient flow: ['(9;) is non-increasing because each sublevel set Sr(ro) is absorbing. 


where [* = mingI(@). The latter is most natural, based on the intuition obtained from Fig. 4.1 
and the similarity between this figure and Fig. 2.6. 

The Polyak-Lojasiewicz (PL) condition considered in Section 4.4.2 is one minimal criterion for 
convergence. When we are ready to approximate the gradient flow using an Euler approximation, 
it is standard to impose the following L-smooth condition: for some L > 0, and all 6, 6’, 


r(0’) < (0) + [6 — OVE (6) + SL\|e’ — 4||? (4.22) 


The usefulness of this bound is explained in Section 4.4.3. 

Please do not feel you have to prove a theorem before you can experiment: the algorithms we 
obtain are frequently successful in practice, even when our assumptions are violated. For example, 
the optimization problems arising in training neural networks do not satisfy any of the assumptions 
presented here, but practitioners commonly apply the gradient descent algorithm described in 
Section 4.4.3, which is nothing but an Euler approximation of the gradient flow. 


4.4.1 Role of convexity 
The term has come up in casual discussion in preceding chapters, but now requires a formal defi- 


nition: 


Convecxity. 


> A set S C R® is convex if it contains all line segments with endpoints in S. That is, 
(1—a)6°+ a6! € S for 0°, 6! € S, and any a € (0,1). 


> A function T: S — R, with convex domain S, is called convex if the following bound 
holds: for any pair 6°, 61 € S and p € (0,1): 


MG pee = ar ener) (4.23) 
It is quasi-convezx if the sublevel set Sp(r) is convex (or empty) for any r € R. 
A convex function is always quasi-convex. With S = R, any continuous non-decreasing function 
is quasi-convex, since in this case Sp(r) = (—oo, a(r)] for each r, where a(r) = max{0 : T(0) < r} 


(continuity isn’t required to ensure quasi-convexity, but is required to arrive at this representation 
for the sublevel set). 
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The following characterization has a stronger geometric flavor: 


Lemma 4.5. For any convex set S, a functionl: S > R is convex if and only if for each 6° € R4, 
there is a vector v° € R®@ satisfying 


r(0) > 70°) + (v°,¢-0°), forall0 eS (4.24) 


The right hand side of (4.24) is regarded as an affine function of 6, and the bound means that the 
graph of [ is always above the graph of the affine function. The vector v° is called a sub-gradient. 
If T is differentiable at 6°, then it is an ordinary gradient v? = VT (0°). 

There are also several stronger conditions: 


A The function I is strictly convex if the inequality (4.23) is strict whenever 61 4 6°. 
A IfT is differentiable, then it is called strongly convex if for a constant d9 > 0, 
(VI(6) — VI(0°),@ — 6°) > doll — 0° ||? ~— for all 6,0° e R¢. (4.25) 


Strong convexity is used to establish nice numerical properties of the gradient flow. The value of 
convexity and strict convexity is made clear in the following. 


Proposition 4.6. Suppose that T: R¢ > R, is convex. Then, for given 0° € R¢, 
(i) If 6° is a local minimum, then it is also a global minimum 
(ii) If T is differentiable at 0°, with VT (0°) =0, then 6° is a global minimum. 


(iii) If either of (i) or (it) holds, and if T is strictly convex, then 0° is the unique global 
minimum O 


The gradient flow is convergent under mild conditions. See Exercise 4.7 for generalization to 
matrix gain. 


Proposition 4.7. Suppose that T is continuously differentiable, conver and coercive, with unique 
minimizer 6*. Then the gradient flow (4.20) is globally asymptotically stable, with unique equilib- 
rium 0*. If T is strongly convex, then the rate of convergence is exponential: 


[9x — OI < e"||80 — | (4.26) 
where 59 appears in (4.25). 


Proof. We adopt the Lyapunov function approach, using V(6) = $\|0 — 6*||?. From the chain rule, 
£V (91) = —Vol(8t)" [9 — 07] 
Convexity implies that [(6*) > T(8:) + Vol(d:)7[6* — 9:], giving 
&V (81) < —[P(8:) — 76") < 0 
where the inequality is strict when 3; 4 6*. Prop. 2.5 then implies global asymptotic stability. 
Under strong convexity we apply Vol (@*) = 0 to obtain the stronger bound: 
EV (81) = —{ Vol (Bt) — Vol (8*)}™ Be — 8°] < —Soll91 — 6°)? 


That is, $V(9,) < —269V(%). This implies that V(9:) < V(9o)exp(—2dot) for any t, which 
establishes (4.26). Oo 
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4.4.2 Polyak-Lojasiewicz condition 


The use of the Euclidean norm to define a Lyapunov function in Prop. 4.7 relied heavily on convexity. 
An alternative is known as the Polyak-Lojasiewicz (PL) inequality: for some y > 0 and all @, 


sIVE (9)? = wlP(@) — 1] (4.27) 
We do not assume here that I is convex, or that there is a unique optimizer. 
Theorem 4.8. Jf the PL inequality holds, then the gradient flow satisfies, for each initial 9o, 
P92) —™ < e™[P(8) — ™] 
If in addition T is coercive, then the solutions are bounded, and any limit point 0. of {9} is an 
optumizer? T(0@,) =". 
Proof. We adopt the second option in (4.21) for a Lyapunov function: 
VO) =3@)—-M] = > FV (81) = —9I| Vol )I? < —KV (82) 


This implies the desired inequality ['(8,) — T* = V(8) < eV (8p). 

If T is coercive then trajectories of 9 evolve in the compact set S = {0: V(@) < V(Oo)}. If Ax. 
is a limit point, this means 0. = limn5.0;,, with t, ft oo. Continuity of the loss function then 
implies optimality: ['(@.0) = limp oo T(8:,,) =T*. O 


4.4.3. Euler approximation 
The standard Euler approximation of the gradient flow (4.20) is known as steepest descent: 
Oe41 = On — AVE (Ox) (4.28) 


To obtain convergence from each initial condition we suppose that the objective function is L- 
smooth (recall (4.22)). This is one of two crucial bounds in the proof of the following cousin of 
Thm. 4.8: 


Theorem 4.9. Suppose that T satisfies two bounds: the L-smooth inequality (4.22), and the PL 
inequality (4.27). Then, provided a <1/L, the following bound holds for (4.28): 


T(x) —T* < (1—ap)*[P (60) — T*] 


Proof. The L-smooth inequality applied to the gradient descent recursion (4.28) gives 
T(On41) — P(x) < [Beri — O/T VT (Ox) + 5L[On41 — nll? = (—a + 5La7)||VE(G,) II? 
The bound —a + 5La? < —5a holds when a < 1/L. This and the PL inequality (4.27) imply 
P(x41) — Pn) < —ag| VEGF < —ap[T(Ox) — T). 
Re-arranging and subtracting I from both sides gives 


F(On41) —T* < (1 — an) [P (6x) — P] 


Iterating this inequality gives the result. O 
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In practice, the L-smooth inequality for T is usually verified via a global Lipschitz condition on 
its gradient: 
[VT (6) — VP (9)|| < LI|6" — 4 (4.29) 


Lemma 4.10. Suppose that (4.29) holds for all 0,6’ € ©, where @ C R¢. We then have, 
(i) |(VT (6’) — VI (8), 6 — 8)| < L\|6’ — 6||? for all 6,0’ € ©. 
(ii) If © is convex, then T is L-smooth. 
Proof. Part (i) is immediate from (4.29): 
(VT (6) — VP (6), & — 8)| < || VEG’) — VE (@)|\I16" — 4l| < LI|6" — all? 


To establish (ii), for 6,0’ € © denote & = 6+ t[6’ — 6] and & = T(@,). We have & € O for 
0 <¢< 1 under the convexity assumption. The function is differentiable on this domain, with 
& = (VT (0), 0 — 6). Applying (i), 


o&, — ££ = (VT (0) — VI (0), 0 — 8) < tL||6" — 4||? 
Integrating from t = 0 to t = 1 gives (4.22): 
1 
Mo) =a) =a + [gear 


< fo + Ho + §LI|6' — 4||? 
=T(0) + (VT (0), 0’ —6)+ 5L||0’ — 6\|? 


O 


4.4.4 Constrained optimization 


Consider the optimization problem with equality constraints: 


m= min [(0) 


st. g(0)=0 oe 


where g: R¢ + R™ (so there are m > 1 constraints). The constraints are convex if and only if g is 
an affine function of #: for an m x d matrix D and vector d € R”, 


g(6) = DO +d (4.31) 


One approach to the solution of these problems is through a Lagrangian relaxation, defined through 
a sequence of steps. Introduce the Lagrangian £L: R¢ x R™ > R: 


L(0,A) =T(0)+ATg(0), O@€RY, AER™ (4.32) 
The so-called dual function is the minimum of the Lagrangian, with constraints removed: 


gr (A) = min L(O,A) (4.33) 


The value —oo is possible: consider what happens when TI is a linear function of 0. 
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* T(6) f e°(A) = min £6, A) 
r* 


Primal 
Dual 


vu 


0° x 


Figure 4.2: Primal and Dual optimization problems. 


For any A € R™ we obtain a lower bound on I* as follows: 
o*(A) < min{£(0,) : 9(8) = 0} = min{T(@) : 9(0) = 0} = T* 


where the inequality holds because we have re-introduced the constraints, which means we are 
minimizing £ over a potentially smaller set. The dual problem is defined to be the maximum of y* 
over all A: 


max min £(G,A) = max p' (A) <T* (4.34) 


We say there is a duality gap if the inequality is strict. The left hand side is called a min-max (or 
saddle point) problem. Fig. 4.2 shows a typical example of convex optimization without duality 
gap. 

There is a simple ODE to obtain a solution to the saddle point problem (4.34), known as the 
primal-dual flow: 


£9 = —VoLl (9,A) = —Vol (8) — [09g (8)JTA (4.35a) 
fA = VjL(9,A) = (9) (4.35b) 
where, under the affine assumption, [Ogg (9)|'A = DTA. 
Proposition 4.11. Suppose that the following hold: (i) T is strictly convex and coercive, and (ti) 


the function g is affine, of the form (4.31), in which the m x d matriz D has rank m. Then, the 
primal-dual flow converges to the unique solution (0*,A*) of the dual: £(0*,A*) = y*(A*) =T* O 


The proof can be found in Section 4.8.3. The first step is to exploit convexity to show that 
V(0,A) = 5||0 — 6* ||? + 5||A —A*||? is a Lyapunov function. This part of the proof is very similar to 
the proof of Prop. 4.7. 

The case of inequality constraints is considered next, where similar analysis can be applied. 


Inequality constraints We again have a function g: R¢ + R™ that defines the constraints, but 
replace equality with inequality in the primal: 
mS min (0 
(9) (4.36) 
s.t. g(@) <0 


If g; is a convex function for each 7 (or simply quasi-convex), then the constraint region S = {@: 
g(@) < 0} is a convex set. 
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The Lagrangian and dual function y* are defined exactly as before, but we must restrict to 
\ € R® to obtain the prior upper bound: 


e"(A) < min{£(4,A) : g(9) < 0} < min{I(8) : 9(8) < OF = 


where the second inequality is based on the bound ATg(@) < 0, whenever g(@) < 0 and A> 0. The 
saddle point problem is defined as before: 


. * * 
oe E(@,A) = re (A) <P (4.37) 
Subject to convexity and minor additional assumptions, there is no duality gap (the inequality is 
replaced with equality). 

A pure steepest ascent algorithm to compute A* would be of the form iM = Vy" (Az). In the 
case of inequality constraints considered here, we must include a reflection process to ensure that 
Az (i) > 0 for each 7 (details to follow shortly). 

A representation of the gradient is easily found. Suppose that for each A € R’, there is 6°(A) 
satisfying 

e"(A) = min{T(8) + ATG(8)} = PA) + ATG (A) 


Proposition 4.12. For any A° € R™, a sub-gradient of the dual function is given by Vip* (A°) = 
[9(0°(A°))|". That is, for allA € R™, 
g*(A) < p*(A°) + [gH A°))IT(A — 2°) 
Proof. We have by the definitions, 
g(a) = min{P(8) + ATg()} < P(A*(A°)) + ATG(B*(A®)) 
= [P(A (A°)) + APTG(B(AP))] + (A — APT G(B*(A°)) 
= g* (A?) + [g(6°(A°))|T(A — A”) 


O 


This inspires the primal-dual flow that is almost the same as (4.35). The ODE for the parameter 
estimate is identical: 


49, = —VI(8) — [8g (81) 


The ODE for the dual variable A must be modified to impose non-negativity. This comes in the 
form of an m-dimensional reflection process y. It is easiest to express the new dual dynamics in 
integral form 


t 
N= Rot f g(Br) drt ae, (4.38) 
0 


where Ag > 0 is the initial condition. The reflection process is defined by three constraints (for each 
Les rae 


1. y(t) =0 
2. y(z) is non-decreasing, and the solution to (4.38) is non-negative (A;(7) is non-negative for 
each 7 and t). 
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3. It is the minimal function of time satisfying 1 and 2. This is equivalently expressed 


T 
| Onis. pralrso (4.39) 
0 


The integral (4.39) is defined in the sense of Riemann and Stieltjes, and on combining 1-3 we see 
that 
M(i) > 0 4y(i) =0 


Prop. 4.11 extends easily to this primal-dual flow, and the proof is almost identical: the property 
(4.39) allows us to disregard the reflection process in a crucial part of the Lyapunov function 
analysis found in Section 4.8.3. 


Proposition 4.13. Suppose that T is strictly convex and coercive, that g is convex, and suppose 
that Og (@*) has rank m. Then the primal-dual flow converges to the unique solution (6*,A*) of the 
dual: 

LT NY = yp" (A*) _ r* 


O 


Euler approximation A remaining question is how to translate the primal-dual flow with re- 
flection into a discrete-time algorithm, since (4.38) is no longer an ODE. A standard primal-dual 
algorithm is defined by the pair of recursions: 


OIn+1 = On — An4i{ VE (On) + [89 (On)]™An } (4.40a) 
Anti = [An + Qntig(On)] , (4.40b) 


where |-| is the component-wise maximum with zero. Hence (4.40b) can be expressed 


An+1 = An a An+19(On) ae A 
Ay = [An + nti 9(On) |, a [An + an+19(9n)] a 


so that A;* may be interpreted as an increment of a reflection process. 


4.5 Quasi-Stochastic Approximation 


We are interested in solving a root-finding problem of a special form, which requires an adjustment 
of notation. Given a function f: R? x Q > R@, and a random vector ® taking values in a set QO 
(assumed to be a subset of Euclidean space), we denote the average (or expectation) by 


f(0) ZELf(0,®)], oeR? (4.41) 


Our goal is then to solve f(6*) = 0 for this exotic function f: R? > R®. In this section we introduce 
algorithms to achieve this goal by adapting the ODE approaches in previous sections of this chapter, 
so that f sometimes plays the role of the vector field f used to define the ODE (4.2): 

d - 


Got = Fo: (2) 
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The big challenge is that we may know little about f or ®. 

If you don’t know what is meant by random, or expectations, you don’t have to worry. We 
avoid any mention of probability in this section by replacing “random variables” by sinusoids, or 
other “bouncy” functions of time. 


For those of you with a background in probability theory. The stochastic approximation (SA) 
method of Robbins and Monro [301] amounts to a variation of the Euler scheme (4.8), in which 
we replace f by samples from f: 


On+1 =O,+ Cnet Ons ®,,) ) n= 0, (4.43) 


where {®,,} are random vectors, whose distributions approximate those of ® for large n, and 
{ay} is a non-negative step-size sequence. 


We say that an ODE approximation holds if 0, + 9;,,, where 9 is the solution to (4.42), and 
the sampling times {t,,} are defined as in (4.8). The assumptions required for a good approx- 
imation are not very different than what is required to successfully apply the deterministic 
Euler approximation (4.8). 


The upshot of stochastic approximation is that it can be implemented without knowledge 
of the function f or of the distribution of ®; rather, it can rely on observations of the se- 
quence {f(0n,®,)}. This is one reason why these algorithms are valuable in the context of 
reinforcement learning. 


In much of the SA and RL literature it is assumed that ® is a Markov chain: a topic considered 
in depth in Chapter 6. A motivating observation in the present chapter is that Markov chains need 
not be stochastic: the deterministic state space model (2.21) (without control) always satisfies 
the Markov property used in Part 2 of this book. For example, for given w > 0, the sequence 
®,, = [cos(wn), sin(wn)] is a Markov chain on QO (the unit circle in R?). This motivates the quasi- 
stochastic approximation (QSA) ODE: 


Or = af (Or, Ex), (4.44) 


We use the terms gain and step-size inter-changeably for the non-negative process a, and & is called 
the probing signal. 
The expectation in (4.41) is defined by the sample path average: 


T 
f(0) = lim zi f(0,&) dt, for all@ eR’. (4.45) 
T3000 T 0 


Of course, the existence of this limit requires assumptions on f and the probing signal. 
The probing signal is deterministic in the QSA theory developed in this chapter. Two canonical 
choices with O C R™ are the m-dimensional mixtures of periodic functions: 


K 

&= > v' [di + Wit] (mod 1) (4.46a) 
i 

E, =) v' sin(2n[d; + wit]) (4.46b) 
1=1 
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for fixed K > 1, vectors {v’} C R™, phases {¢;}, and frequencies {w;}. Such signals have well 
defined steady-state means and covariance matrices. Consider for example (4.46b) in the special 
case 

E,(4) = V2sin(uxt) , 1<4< nt (4.47) 


with w; # w; for alli # 7. The steady state mean and covariance then satisfy 


ee 
jm, T Ii-o Edt=0 (4.48a) 
1 T 
a nae T = 
jim T |i fe d= (4.48b) 


where J is the identity matrix. 
For a function g: R“ — R we can expect the following asymptotic independence, provided the 
frequencies {w;} are distinct: 


a ae 
jm iF / g(sin(2a[¢1 + wit]),... ,sin(2a[¢~ + wxt})) dt 


4 (4.49) 
= / . f g(sin(27[d) + ti]),... ,sin(2t[dx« +tx])) dt ---dtx 


See Lemma 4.37 for a precise statement based on another general class of probing signals. 
The following subsections contain examples to illustrate theory of QSA, and also a glimpse at 
applications. 


4.5.1 Quasi Monte-Carlo 


Consider the problem of obtaining the integral over the interval [0,1] of a function y: R > R. 
In a standard Monte-Carlo approach we would draw independent random variables {®(k)}, with 
distribution uniform on the interval [0,1], and then average: 


n-1 


y(®&(k)) (4.50) 
k=0 


On = 


Sle 


In one QSA approach, the probing signal is the one-dimensional sawtooth function, & =t 
(modulo 1), and estimates are defined by the average 


1 t 
@=7 | (Ear (4.51) 
0 
Alternatively, we can adapt the QSA model (4.44) to this example, with 
f(0, &) = y(&) — 0. (4.52) 
The mean vector field is given by 
7 r i 
7(6) = jim =f 7(0,E.)at= [ y(E)at—6 
T>0o 0 
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Figure 4.3: Sample paths of Quasi Monte-Carlo estimates. 


so that 6* = foy (E,) dt is the unique root of f. The QSA ODE (4.44) gives 


GOr = atly(Er) — Or]. (4.53) 


The Monte-Carlo approach (4.51) can be transformed into something resembling (4.53). Taking 
derivatives of each side of (4.51), we obtain using the product rule of differentiation, and the 
fundamental theorem of calculus, 


1 


d _ 
HOt _ “2 


yee -y(Ee) = “(ue O;] 


This is precisely (4.53) with a; = 1/t (not a great choice for an ODE design, since it is not bounded 
as t | 0). 

The numerical results that follow are based on y(@) = e* sin(1000), whose mean is 6*  —0.5. 
The differential equation (4.53) was approximated using a standard Euler scheme with sampling 
interval 10-3. Several variations were simulated, differentiated by the gain a, = g/(1+t). Fig. 4.3 
shows typical sample paths of the resulting estimates for a range of gains, and common initialization 
©» = 10. In each case, the estimates converge to the true mean 6* = —0.5, but convergence is very 
slow for g > 0 significantly less than one. Recall that the case g = 1 is very similar to what was 
obtained from the Monte-Carlo approach (4.51). 

Independent trials were conducted to obtain variance estimates. In each of 104 independent 
runs, the common initial condition was drawn randomly,!! and the estimate was collected at time 
T = 100. Fig. 4.4 shows three histograms of estimates for standard Monte-Carlo (4.50), and QSA 
using gains g = 1 and 2. An alert reader must wonder: why is the variance reduced by 4 orders of 
magnitude when the gain is increased from 1 to 2? The relative success of the high-gain algorithm 
is explained in Section 4.5. 


pi = -0.47 Monte Carlo pi = -0.48 QSA pe = -0.48 QSA 
o? = 20-03 9-1) 4? — 16-03 9=1) |g =1e7 —e 


Figure 4.4: Histograms of Monte-Carlo and Quasi Monte-Carlo estimates after 10* independent runs. The optimal 
parameter is 6* + —0.4841. 


™!N(0,10) (Gaussian with zero mean and variance 10) 
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Buyer beware The remainder of this chapter is based on extensions of quasi Monte-Carlo, which 
is traditionally framed in discrete time. The sawtooth function used in (4.51) is a common choice 
in this research area, defined more generally in discrete time as follows: 


E(k) = €(0) +wk (mod 1) (4.54) 


Subject to conditions on the parameter w and function y: R — R, we have the Law of Large 
Numbers: 


This is known as the Equidistribution Theorem (see [33], and [149, p. 87] for more history). The 
quasi Monte-Carlo literature contains more sophisticated techniques to define well behaved “probing 
sequences” . 

Sinusoids and sawtooth functions are used in this chapter for simplicity (and because of my own 
ignorance of the substantial literature on pseudo randomness). An expert on quasi Monte-Carlo 
methods might suggest a two-step process in translating a QSA ODE to discrete time: 


1. A discrete-time approximation of the QSA ODE. 
2. Careful selection of the probing sequence in discrete time. 


I am hopeful that step 2 can be avoided, as long as we are careful with step 1. We are not bound 
to Euler here: remember that Matlab’s ode45 is based on more efficient numerical methods. 


4.5.2 System identification 


The next example illustrating QSA techniques concerns system identification. Consider the non- 
linear state space model in continuous time: 


d é 
tt = f(az, uz) +d: , Lo given 


(4.55) 
Ye = 2(Le, Ut) + Wr 


with state 2, € R”, input uw € R™”, and the output is taken scalar for simplicity: y, € R. The 
signals {d;} and {w;} are known respectively as the disturbance and measurement noise. 

The functions f and g that define the dynamics are not known. We are given observations 
of the input and output (and possibly also the state), and wish to find a model that fits these 
measurements. One approach is to propose a parameterized family of models: 


gat =f(at,u30), 25 = 20 


The goal is to estimate 0* € R®@ based on input-output measurements, where this “best parameter” 
corresponds to a model that best reflects input-output observations. 
In the prediction error method we introduce a loss function T: R¢ > R, defined by 


: 
ro) = few viva 


in which y is observed at time t, and yf is obtained from the model with identical input wu, and 
initial state 2§ = zo. The function 0: R > R satisfies €(0) = 0 and &(z) > 0 for z £0. 
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Consider the typical quadratic loss, ¢(z) = z*. We might hope to apply gradient descent to find 
the minimizer 6* of T, where the gradient can be expressed 


1 T 
Vol (8) = = | Voll(ys — yf) at 
0 


1 


“ 
= ae f (ys — yf) Voy? at 


We are then faced with finding a model for the gradient of the observations. 
We might design the model so that this is an easy task: 


yp =O 


where the regression vector ¢,; € R®@ is a function of observations and not the model. Hence the 
gradient is expressed in terms of observables, Voy? = ¢¢, and the gradient of the loss function is 
linear: $Vol (0) = M0 — b, with 


i ae a 
M= = dt d b== dt 
a. O19; an a YEPt 


These representations are one motivation for ARMA models (see Section 2.2 for a definition in 
discrete time). 

In the absence of such a simple description, we look more closely at the state space model. 
Assume that «? and y? are each continuously differentiable in (t,@). Let S? = O92? denote the 
n x d matrix of partial derivatives of the state: 


Recalling the calculus convention Voy? = [Ogy?]", we obtain by the chain rule, 


4? = fp(x?, us; 0)S? + fo(x?, uz; 0) 
oul = g(r, uz; 0)S? + go(x?, ut; 4) 


where each subscript on the right hand side represents a partial derivative. For example, f, (x, u; 0) 
is the n X n matrix with entries 


fle, u hig = 5 


There are two challenges with this approach: one is potential numerical instability of the dif- 
ferential equation generating {S?}. Another is the complexity of this ODE, especially when the 
dimension n is large: do we really have to generate {S? :0 < t < T} in order to obtain Vol (0) 
for a single value of 6? This is a massive burden if we require many iterations of steepest descent. 
This is ample motivation for the gradient-free approaches to optimization surveyed in Section 4.6. 
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4.5.3 Approximate policy improvement 
Consider again a nonlinear state space model in continuous time, 
out = f(xz, Ut) 5 t > 0 


with 2; € R”, uy € R™. Given a cost function c: R"*™ — R, our goal is to approximate the optimal 
value function 


d*(2) = min c(xz, ue) dt, T= iG 
0 


U 


and approximate the optimal policy. For this we first explain how policy improvement extends to 
the continuous time setting. 
For any feedback law uz = (2,), denote the associated value function by 


[oe 
(0) = f° olei,d(ee)) dt, =a 
0 
It follows from Prop. 2.7 that this solves a dynamic programming equation: 
0 = c(x, &(x)) + VIP (2) - f(x, &(2)) 
The policy improvement step in this continuous time setting defines the new policy as the minimizer: 


bt (x) € arg min{c(2, u) + VI (x) - f(x, u)} 


Consequently, approximating the term in brackets is key to approximating PIA. 
An RL algorithm is constructed through the following steps (which were first proposed in 
Section 3.7). First, add J® to each side of the fixed-policy dynamic programming equation: 


J® (x) = J® (x) + e(x, b(2)) + V9 (a) - f(a, (2) 
The right-hand side motivates the following definition of the fixed-policy Q-function: 
Q?(x,u) = J?(x) + c(x,u) + f(x, u) - VI® (2). 


The policy update can be equivalently expressed +(x) € arg min, Q?(zx,u), and this Q-function 
solves the fixed point equation 


Q? (a, u) = Q(x) + c(a, u) + f(x, u) - VQ (2) (4.56) 
where H®(x) = H(x,(x)) for any function H (note that this is a substitution, rather than the 
minimization appearing in (3.7d)). 

Consider now a family of functions for approximation {Q® : 6 € R®}, and consider the Bellman 
error: 
BY (x,u) = —Q"(x,u) + Q*(x) + e(x,u) + f(x, u) - VQ" (a) (4.57) 


A model-free representation is obtained, on recognizing that for any state-input pair (+z, uz), 


B° (xp, ur) = —Q? (xz, ut) + Q* (x4) + c(xe, we) + £Q° (zx) (4.58) 
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The error B°(x;, uz) can be observed without knowledge of the dynamics f or even the cost function 
c. The goal is to find 6* that minimizes the mean square error: 


6712 Sai im 4 fe" Game (4.59) 


We choose a feedback law with “exploration”, of the form introduced in Section 2.5.3: 
ut = (az, Er) (4.60) 


chosen so that the resulting state trajectories are bounded for each initial condition, and the joint 
process (a#,u,&) admits an “ergodic steady state” (meaning that the existence of sample path 
averages such as (4.59) is guaranteed). 

This approximation technique defines an approximate version of PIA: given a policy and 
approximation Q, the policy is updated: 


+(x) = arg min Q(x, u) (4.61) 


U 


This procedure is repeated to obtain a recursive algorithm. 


Least squares solution Consider the loss function 


Suppose that the function approximation architecture is linear (3.43), so that Tp is a quadratic 
function of 0: 


Tr(6) = 67 Mré — 21.6 + TPr(0) = (6 — 6*)' Mr (6 — 6*) + Tr(@*) 


We leave it to the reader to find expressions for Mr, br, and l(0). 

In this special case we do not need gradient descent techniques: the matrices Mr and br are 
obtained as sample-path averages—the Monte-Carlo approach surveyed in Section 4.5.1—and then 
0, = Mp br is the unique minimizer of Ir. 


3 t x 102 


1 2 
30,000 samples 


Figure 4.5: Comparison of QSA and Stochastic Approximation (SA) for policy evaluation. 


Gradient descent Without a linear parameterization we turn to gradient descent to minimize 
l, with gradient 


1 T 
VT(0) = lim zf [B° (xe, ue) VB? (4, ut) dt 
0 


T0o 
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The first-order condition for optimality is expressed as the root-finding problem VI'(@) = 0, and 
the standard gradient descent algorithm in ODE form is 


£9, = —Vol (92) 
Its QSA counterpart (4.44) is 
+0; = —a,B% (a4, uz) C2 (4.62a) 
PV 9B (a1, un) = —VoQ (ar, ue) + {VoQ (ae, O(a1)) + ZV (ar, O()) } (4.62b) 


where (4.62b) follows from (4.58), provided we can justify the exchange of differentiation with 
respect to time and with respect to @. 

The QSA gradient descent algorithm (4.62) is best motivated by a nonlinear function approx- 
imation, but it is instructive to see how the ODE simplifies for the linearly parameterized family 
(3.43). We have in this case 


Ce = —W (Xt, Ut) + (et, (ze) + Sup (ae, (2z)) 
and B? (a4, uz) = c¢ + C/O using c, = c(xz, WH), so that (4.62a) becomes 
LO, = —a4 [C67 O: + Gr] (4.63) 
The convergence of (4.63) may be very slow if the matrix 
1 t 
Get ea T 
= jim ; : ¢7¢1 dr (4.64) 


has eigenvalues close to zero. This can be resolved through the introduction of a larger gain a, or 
a matrix gain. One approach is to estimate R$ from data and invert: G; = [Ro]-} with 


& 1 x t a 
Re=— 1 Re +f aati RS >0 (4.65a) 
t+1 ‘ 
1 Op = —arGy [Cf Or + adi], t20 (4.65b) 


Numerical example Consider the LQR problem in which fx = Ar+ Bu, and c(#,u) = aTSxt+ 
ul Ru, with S >0 and R> 0. The fixed-policy Q-function associated with any stable linear policy 
(x) = —Kz, takes the form 


cross (E de aut VD) 


where M solves the Lyapunov equation (2.46) with S replaced by ATRK + S: 
A'M+ MA+K'RK+S=0 
This motivates a quadratic basis, which for the special case n = 2 and m = 1 becomes 
(a, u) = (x?, 02, 2122, 21uU, tou, u7)T 
In order to implement the algorithm (4.65b) we begin with selecting an input of the form 
ut = —Keae + & (4.66) 


where K, is a stabilizing feedback gain (which need not be the same K whose value function we 
wish to approximate). 
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The numerical results that follow are based on a double 
integrator with friction: 


y= -O.ly+u 


which can be expressed in state space form using x = (y,y)": 


Figure 4.6: Iterations of PIA 
pe Naga | (4.67) 
#0; 20.1 i} 


A relatively large cost was imposed on the input: S = J and R = 10. 

Fig. 4.5 shows the evolution of the QSA ODE (4.65) for the evaluation of the policy with gain 
K = [1,0], in which the input (4.66) used kK, = [1,2] and & the sum of 24 sinusoids with random 
phase shifts and whose frequency was sampled uniformly between 0 and 50 rad/s. The gain was 
a, = 1/(1+t). The QSA ODE is compared with the related SA algorithm in which € is “white 
noise” instead of a deterministic signal.!” 

The gain K was chosen as an initialization in approximate policy iteration: with Kp = K we 
obtain an approximation Q* for the associated fixed policy Q-function for this linear policy, and 
then 1(x) = Ky, is obtained via the policy improvement step (4.61) with Q = Q®”. These steps 
are repeated to generate a sequence of parameter estimates {6"} and feedback gains {K,,}. Fig. 4.6 
shows the weighted error for the feedback gains, where the optimal gain K* is obtained from the 
ARE derived in Section 3.9.4. The PIA algorithm indeed converges to the optimal control gain K*. 


4.5.4 A brief tour of QSA theory 


While QSA theory is far simpler than stability of its stochastic ancestor, the technicalities are best 
left to the end of the chapter—see Section 4.9 for details. Contained here is an overview, and some 
guidelines for algorithm design. 

Our interest is not just in convergence of QSA, but convergence rates, and intuition regarding 
the choice of algorithm parameters. We say that the rate of convergence is 1/t2° if 


Co 8 @0 


(4.68) 
0 e< 20 


t-00 


lim sup t2{|©;|| = 


where ©, “= ©, — 6* is the estimation error. By careful design we can achieve @9 = 1, which is 
optimal in most cases. Exercise 4.18 shows that convergence may be much faster if the probing 
signal acts purely multiplicatively, rather than additively as in the Monte-Carlo example. 


QSA-ODE solidarity The apparent noise plays a crucial role in the analysis: 
Er = f(r, Ex) — f(r) (4.69) 


so that 7 2 
£0, = ai[f(©1) + Er] (4.70) 


For implementation, both (4.65) and the linear system (4.67) were approximated using Euler’s method, with 
time-step of 0.01s. 
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While this is similar to the ODE (4.42), an apparent discrepancy is that the gain a is absent. In 
the continuous time theory it is simplest to introduce a gain for the purposes of comparison: 


40; — arf (Qr) 5 t > to, Or, = Or, (4.71) 


where the choice of tg depends on the stability properties of the associated ODE (4.42) with constant 
gain. 

We are left with two steps: 
Step 1: Understand the relationship between the solution to the original ODE (4.42), and the 
solution to (4.71). 
Step 2: Obtain bounds on the error between solutions to the QSA ODE (4.70) and the ODE (4.71) 
in which the apparent noise is removed. In this step we consider the scaled error: 


Z, (QO, —©;) , i >t (4.72) 


_i 
= 
It is shown that this is a bounded function of time under mild assumptions. 


Most of Section 4.9 is devoted to Step 2. Step 1 is addressed easily through the following change 
of variables: 


t 
T= % “| a, dr, i (4.73) 
0 


Lemma 4.14. Let {8, : tT > to} denote the solution to (4.42) initialized at time To = Sig, with 
9x, = O;,. The solution to (4.71) is then given by 


@,=%:, t>to Oo 


Gain selection Consider the standard choice of gain 
az = g/(1+t)? (4.74) 


in which g > 0 and 0 < p< 1 are fixed. The time-scaling reveals a significant difference between 
p< landp=t1: 
glog(1 + t) pei 
TH 1 (4.75) 
go Se ee ped 
Lop 
It is here that we arrive at an apparent conflict in choice of gain. To make this clear, suppose 
that the ODE (4.42) satisfies the following form of exponential asymptotic stability: there exists 
00 > 0, Bo < co such that for any solution to (4.42), and any t > 0, 


||: — || < Bo||8o — 4" || exp(— got) (4.76) 


From the identity ||O, — 6*|| = ||9, — 6*|| we come to very different conclusions, depending on p: 
p <1: {O;} to converges to 6* very quickly. However, in this case boundedness of {Z;} implies a 
sub-optimal rate: 


||: — O,|| < Bz (4.77) 


1 
(1+t) 


where Bz is a function of the initial condition Oo. 
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p = 1: The bound (4.77) is ideal, but the rate of convergence of {O;} may not be: an application 


of (4.75) gives 
1 


(1 + t)9¢0 
To obtained the optimal 1/t convergence rate for QSA requires g > 1/09. This high gain may 
lead to other problems, such as large transients. 


||O: — || < Bol|@o — 9*|| 


It is here that the averaging technique of Polyak, Juditsky and Ruppert (PJR averaging) comes 
to the rescue. We use p < 1 to exploit the fast convergence of {O;} to 6*, and then reduce volatility 
by simply averaging some fraction of the estimates: 


1 T 
©; dt (4.78) 


def 
OF = 
LT — To Jt 


For example, Tp = T — T/5 means that we average the final 20%. This approach will achieve 
the optimal 1/T convergence rate under very mild assumptions. Section 4.5.5 provides a gentle 
introduction to the theory for a special case.!° 


Basic assumptions and conclusions It is simplest to adopt a “Markovian” setting in which 
the probing signal is itself the state process for a dynamical system: 


2 = H(E) (4.79) 


where H: O > O is continuous, with O a bounded subset of Euclidean space. A canonical choice 
is the K-dimensional torus: O = {x € C* : |x;| =1, 1<i< K}, and & defined to allow modeling 
of excitation as a mixture of sinusoids: 


&, = [exp(jwit),...,exp(jwxt)|T (4.80) 
with distinct frequencies, ordered for convenience: 0 < wy < wg <--:< wx. The dynamical system 


(4.79) is linear in this special case. It is ergodic, in a sense made precise in Lemma 4.37. 
The following assumptions are imposed throughout the remainder of the chapter: 


(QSA1) The process a is non-negative, monotonically decreasing, and 
[oe] 
lim a; = 0, i dp Or = OO: (4.81) 
t-00 0 


(QSA2) The functions f and f are Lipschitz continuous: for a constant L f <o, 


IF’) — FO) < Lyllo’ — Ol), 
FO, 2) — £0, 2)I| < Ly|lO— Ol, 0, 9ER*, z€0 


There exists a Lipschitz continuous function b): R¢ > R4, such that for all 6 € R4, 


| ” jo) at| <bp(), O<to<ti, where f,(0) = f(0,&) — (0) (4.82) 


The “J” is omitted from the super-script in (4.78): this is to keep notation compact, and also because of Polyak’s 
independent work before Juditsky. See Section 4.11 for history 
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(QSA3) The ODE (4.42) has a globally asymptotically stable equilibrium 6*. 


The Lipschitz conditions on f and f in (QSA2) are what you would expect if you have been 
exposed to theory for stochastic approximation. General sufficient conditions on both f and & for 
the ergodic bound (4.82) are given in Lemma 4.37. 

A fourth assumption is that the QSA ODE is ultimately bounded in the following sense: there 
exists b < oo such that for each @ € R@ and z € O, there is a To, such that 


|Or|| <b forall t>Ty2, when Op =0, Ey =z (4.83) 


Verification of ultimate boundedness is the subject of Section 4.9.3. For example, (4.83) can be 
established using a Lyapunov drift condition similar to (2.39). 
The proof of Thm. 4.15 is also found in Section 4.9. 


Theorem 4.15. (Boundedness Implies Convergence) Suppose that (QSA1)-(QSA3) hold, 
along with the ultimate boundedness assumption (4.83). Then, the solution to (4.44) converges to 
0* for each initial condition. 


Coupling The following partial integrals play a central role when we turn to rates of convergence: 
for 6 € R¢ and T > 0, 


T 
=1,(0) = / fi(0) dt (4.84) 


This is a bounded function of T under (QSA2) (recall (4.82)). The coupling to be established is 
expressed as the limit, 
dim |Z, — £;(0")|| = 0 (4.85) 


This implies precise bounds on the rate of convergence of ©; to 6*, since by the definition (4.72), 
O; = O* + ar Z, 


Details of this theory are postponed to Section 4.9. Here we simply illustrate the conclusion 
with a simple example. 


-1 
g 2 
oes nw FLUO HD ~ 


95 96 97 98 99 100 t 


Figure 4.7: Evolution of Z; = (1 + t)©, using Quasi Monte-Carlo estimates for a range of gains. 
Consider the linear QSA ODE with vector field 


f(0,z)=A(O-6@*)+ Bz, OER, zeO (4.86) 
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In this special case, (4.84) is independent of 0: 


The coupling result (4.85) is illustrated using the simple Monte-Carlo example, whose plots are 
shown in Fig. 4.3. The representation (4.53) is easily modified to take the form (4.86). First, 
denote by &° a periodic function of time whose sample paths define the uniform distribution on 
(0, 1]: for any continuous function c, 


1 rt 1 
lim a c(E?) a= f c(x) dx. 


TT 00 


We previously used the sawtooth function, £° = ¢ (mod 1). Introduce a gain g > 0, and consider 


Or =~ ly(Er) ~ @) (4.87) 
This is of the form (4.86) with A = —1, B = 1, and & = [y(&2) — 6”). 

Thm. 4.24 below implies the coupling result (4.85) only for g > 1. Figs. 4.3 and 4.4 illustrate 
the qualitative conclusion of Thm. 4.24. Coupling is illustrated in Fig. 4.7. 

The scaled error g~! Z; is compared since & grows linearly with g: we expect g-!Z; © i y(E°(r))— 
0* dr for large t. 

The figure compares results using ten gains, approximately equally spaced on a logarithmic 
scale. The smallest gain is g = 1.5, and all other gains satisfy g > 2. Thm. 4.24 asserts that 
|Z, — =f] = O([1 + t]~°s), where 6g < 0.5 for g = 1.5, and 6g = 1 for g > 2. The initial condition 
was set to @p = 10 in each experiment. The scaled errors {g-!Z; : 95 < t < 100} are nearly 
indistinguishable when g > 2. 


4.5.5 Constant gain algorithm 


The choice a; = a (independent of time) is often favored in practice. For the linear model (4.86) 
this is not unreasonable, as we shall see in the following. The highlight is Cor. 4.17, establishing 
the optimal rate of convergence when using PJR averaging. 

However, please be warned: the conclusions below are highly sensitive to the precise modeling 
assumption (4.86). Consider the minor variant, using 


f(0,z)=[4o+ezA1]0, O€R4, zeER (4.88) 


Stability of the constant gain algorithm is characterized in [36] for this and more general linear 
systems with multiplicative “quasi-disturbance”. The characterizations have little resemblance to 
any other theory in this book—in particular, the average vector field f(0) = Ao@ is often useless 
for stability analysis. See Exercise 4.18 for a short tour of this theory. 

Convergence theory for vanishing gain algorithms is far more intuitive, even when subject to 
multiplicative disturbances as in (4.88). 

Analysis for the linear model (4.86) is simplified since QSA is a time-invariant linear system: 


£0: = a[ AO; + BE] 
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def 


where ©, = ©; — 6* is the error at time t. We can solve this ODE when the probing signal is 
the mixture of sinusoids (4.46b), whose derivation is simplified by applying the principle of super- 
position. To put this to work, we restrict to the probing signal (4.46b), and for each i, consider the 
ODE = 7 . _ 

46} = a[AG, + v' sin(2a[9; + wit])) , g=0 


The principle states that the solution to this ODE can be represented as the sum 


K 
0, = e*"O,+ BY OF (4.89) 
i=1 
The response to the initial error Oo = Qo — &* decays to zero exponentially quickly. Consequently, 
to understand the steady-state behavior of the algorithm it suffices to fix a single value of 7. 
For more complex probing signals we can again justify consideration of sinusoids, provided we 


can justify a Fourier series approximation. Let’s keep things simple, and stick to sinusoids. And it 
is much easier to work with complex exponentials: 


16, = al AO, + vexp(jwt)] , Qo =0 


with w € R and v € R®@ (dropping the scaling 27 and the phase @ for simplicity). We can express 
the solution as a convolution: 


0; = of exp(a@Ar)v exp(ju(t — r)) dr 
= a([ exp([aA — jwI]r) dr) v exp (jt) 


Writing D = [aA — jw], the integral of the matrix exponential is expressed, 
t 
| e?" dr = De" — I] 
0 


Using linearity once more, and the fact that the imaginary part of e/”¢ is sin(wt), we arrive at a 
complete representation for (4.89): 


Proposition 4.16. Consider the linear model with A Hurwitz, and probing signal (4.46b), for 
which the constant-gain QSA ODE has the solution (4.89). Then © = aWjv' for each i and t, 
with 

W; = Im((aA — jut] [exp(aAt) — exp(27j[¢; + wit}) 1]) (4.90) 


O 


Prop. 4.16 illustrates a challenge with fixed gain algorithms: if we want small steady-state error, 
then we require small a (or large w;, but this brings other difficulties for computer implementation: 
never forget Euler, and the limitations imposed by a large Lipschitz constant!). However, if a > 0 
is very small, then the impact of the initial condition in (4.89) will persist for a long time. 

The PJR averaging technique can be used to improve the steady-state behavior: 
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Corollary 4.17. Suppose that the assumptions of Prop. 4.16 hold, so in particular f is linear. 
Consider the averaged estimates (4.78) in which To = T —T/k for fired Kk > 1. Then, 


K 
all _ 
of =o +5 («Mr +oB) Mj") 
i= 


where 
Mr; =A} lexp(aAT) — exp(aATp)| 


and Mi, equal to the integral of W} appearing in (4.90): 


Mi. = em ([aA]-"aA — j2nw, I~! lexp(aAT) — exp(aATp)| ) 


rlm( [vA — j2nw,I]~! [exp (2r/d: + wiT]j) — exp(2x[¢; + wiToli)] ) 


Jj 
274; 


Hence, OF converges to 0* at rate 1/T. Oo 


4.5.6 Zap QSA 


The convergence theory surveyed in Section 4.9 requires that the ODE (4.42) have a globally 
asymptotically stable equilibrium 6*. A tight bound on the rate of convergence requires that the 
linearization matrix A* is Hurwitz. 

What if A* is not Hurwitz? Or worse, what if the crucial stability assumption fails? Consider 
the two time-scale algorithm: 


40, = [Al 1(@..&). 
ee . (4.91) 
GAt= Ty pelte— Ail, Ar = 85 F(@r, Es) 


This is called Zap-QSA, designed to mimic the Newton-Raphson flow. 

The second ODE is introduced so that Ay ~ A(@;) = O9f (@;) (following a transient). This 
requires 0 < p < 1, meaning we use high gain for the matrix estimate. Provided we can ensure a 
bounded inverse, we arrive at something more closely resembling the Newton-Raphson flow: 


40, - L-a@n (Fe) +51 


where =; = f(@z, &) — f(@z) + ex: the error ¢; comes from the approximation A; © A(@;). 

The most important motivation for the matrix gain in (4.91) is stability, by appealing to the- 
ory for the Newton-Raphson flow. Zap QSA typically shows very fast convergence whenever the 
assumptions required for convergence of the Newton-Raphson flow are satisfied. 


4.6 Gradient-Free Optimization 


How can we find the minimum of a function without computing its gradient? There are gradient- 
free variants of stochastic approximation available that provide an answer. Fortunately, we do not 
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Figure 4.8: Extremum seeking control for gradient free optimization 


have to wait until Chapter 8, since these algorithms can be cast within the framework of QSA 
theory. 
This section concerns the unconstrained minimization problem 


min I(6). (4.92) 
deR4 


It is assumed that T: R? — R has a unique minimizer, denoted as 6*. To apply QSA techniques we 
relax our goal: find a solution to f(@*) = 0, where this represents the first-order necessary condition 
for optimality: 
f@)ZVT(0), @eER?. (4.93) 
This is equivalent to our original objective if T is convex (recall Prop. 4.6). 

The algorithms described here are based on the following architecture: 


A Create rules that determine a d-dimensional signal YW, with T'(W;) measured for each t. 


A Construct an ODE of the form 
Or = —aVr(t) (4.94) 


in which Vr(t) is designed to approximate (4.93) in an average sense: 


Ty . Ti 
| azVr(t) dt & | azVT(@;) dt, for 2) > Tp = 0: 
To To 


Terminology See the Notes section for history of gradient free optimization techniques. It is 
noted there that there are two distinct approaches. The first is strongly rooted in stochastic 
approximation theory, and commonly goes by the name Simultaneous Perturbations Stochastic 
Approximation (SPSA). The second and much older approach is called Extremum-Seeking Control 
(ESC), which is formulated in a purely deterministic setting. The similarity between SPSA and 
ESC becomes clearer when cast as QSA. 

Within the machine learning community, algorithms of the form (4.94) are known as stochas- 
tic gradient descent (SGD). This motivates the use of the terminology quasi Stochastic Gradient 
Descent (qSGD) in this book. Algorithm qSGD #1 below is a QSA interpretation of the one- 
measurement form of SPSA introduced in [329], and qSGD #2 is a very special case of ESC—to 
be explained based on the general architecture illustrated in Fig. 4.8. 
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4.6.1 Simulated annealing 


The primary motivation for SPSA and ESC algorithms is that they can be run based purely on 
observations of the loss function. A secondary motivation is that the probing can be designed to 
emulate a “simulated annealing” algorithm for optimization of functions that are not convex: the 
probing can help the algorithm avoid local minima. The example described here was designed to 
illustrate this point. 


Fig. 4.9 shows a plot of a highly non-convex function, de- 
fined as the “soft min” of convex quadratic functions: 


4 
(8) = —log(S> exp(—{2: + |]@ - 6'|I?}/0?)) 
i=l 


with o = 1/10, and 


(9, 2d} = {[(2), 1]. [G"),-2], ().-2] , [@), -3]} 


The minimizer is 6* = Gs with ['(0*) ~ —300. 
The algorithm qSGD #1 is defined in (4.96) below. It was 
tested in this special case: 


Figure 4.9: A highly non-convex 
loss function. 


1 
£0; = ~~ EP (Or aes | 


using ¢ = 0.15, ag = min{a, (1+ +¢)~?} with p = 0.9 and @= 10~°. The probing signal was chosen 
to satisfy (4.48): 
E, = V2[sin(tw ), sin(tw2)|T 


with w; = 1/4 and we = 1/e? chosen to obtain attractive plots—higher frequencies lead to faster 
convergence. 

These meta-parameters were obtained by trial and error: if @ or € is too small, then we are 
sometimes trapped in a local minima. 


-= 
— T(@,) be 
a) 
PR a 
== MOF) 3 
nT) § 
= 
¢ 
a 
fe) 
2 
3 
ae a 
5 t x 104 ; 
t<5x 10? 4x10*<t<T=5~x 10! 
(a) PJR Estimate: Nearly Optimal (b) ©, : Start of run (c) ©; : Final 20% of run 


Figure 4.10: Minimizing a loss function with multiple local minima: While ['(@;) is highly oscillatory, the estimate 
@7" is nearly optimal (obtained by averaging the final 20% of parameter estimates). 


The ODE was approximated using a standard Euler scheme with 1 sec sampling interval: the 
crude ODE approximation led to the requirement that w/w is irrational. 
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The two plots on the right hand side in Fig. 4.10 show the evolution of ©; in R? for 0 < t < 
5 x 104, with @) = (—2,—2)'. The plots indicate that the estimates exhibit significant variation 
throughout the run, but in plot (c) it is clear that they are trapped within the region of attraction 
of the global minimum. 

What’s more, averaging is highly successful: the estimate Of* was obtained as the average of 
©, over the final 20% of the run. It is found that [(@**) is only a small fraction of one percent 
greater than ['(0*). 


4.6.2 A menu of algorithms 


In each of the algorithms defined here it is assumed that the process W is the sum of two terms: 
WY, = ©: + c&, t > 0, where ¢ > 0, and & is a d-dimensional probing signal. For simplicity, we 
impose the normalization conditions (4.48) unless stated otherwise: 


ty it. f* 
. cae - . & T = 4. 
jm F o Edt=0, fins [ EG dt =I (4.95) 


This is easily arranged using a mixture of sinusoids. 
The first algorithm is the simplest: 


qSGD #1 
For a given d x d positive definite matrix G, and @y € R¢, 
1 
£0; = —ar= GET (Ys) (4.96a) 
W => ©; + e&} (4.96b) 


This algorithm takes the form (4.44), with 


(0, &4) = ~LGET(O + eb) (4.97) 


If T: R? — R is twice continuously differentiable, then a second-order Taylor expansion of the 
objective function gives 


(6 + e&) =0(0) + cE] VP(0) + de7ET VT (OE, + o(€7) 
and hence f(0,&) = - LP (OE, — G&ELVT (8) + Oe) 


where the error notation o(-) and O(-) notation is reviewed in Appendix A. Under (4.95) this 
implies the following approximation for the averaged vector field: 


0) = tm = ‘ f(0,&4) dt = —-GVT(0) + O(e) (4.98) 


T- 00 T t=0 


QSA theory predicts that qSGD #1 will approximate the steepest descent algorithm for the choice 
Ge i 
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The next algorithm requires differentiation of measurements. It is motivated by the represen- 
tation: 
d / 
al (Wt) = eV (We 


where the ‘prime’ denotes derivative: 's = oe. 


qSGD #2 
For a given d x d positive definite matrix G, and initial condition Oo, 
1 
a Ot = — 40 — GE, gl (We) (4.99a) 
Y, =O; +¢e& (4.99b) 


This algorithm fits neatly in the ESC architecture illustrated in Fig. 4.8, in which the HP (high 
pass) filters are chosen to be pure differentiation: 


Vr(t) = £&, x 27%) 


Algorithm (4.99) can also be cast as a QSA ODE, in which we view & © (&, 4.) € R” as the 
exploration signal. Equation (4.99a) gives 


F(0, E3) = ZGE{EL}TVE (0 + e&,) 


Analysis of each of these two algorithms presents challenges. A challenge with qSGD #2 is 
differentiation of the observations {T(¥;)}. This is motivation for replacing differentiation with a 
high-pass filter. 

For qSGD #1 the challenge is presented by the form of f in (4.97). In many problems we know 
that VI is globally Lipschitz continuous, but T is not. In such cases f is not Lipschitz continuous 
in 0, which will be a standing assumption in the theory. This is not an enormous challenge, since 
we can modify the algorithm to obtain convergence (say, using a projection of parameter estimates 
onto a bounded set). 

Lipschitz continuity is easily established for the next algorithm. 


qSGD #3 


For a given d x d positive definite matrix G, and initial condition Oo, 


49, = —a15-GE ATO, +) —T(@—e&,)} (4.100) 


Denoting by a: f(©z, &) the right hand side of (4.100), we can show that f is Lipschitz in its 
first variable whenever VT is Lipschitz, and in this case f(0,&) = —G&ETVI' (6) + Ole). 

The mean vector field f- for (4.100) admits the same approximation (4.98) under slightly milder 
conditions on the probing signal, with the zero-mean assumption (4.48a) dropped. Moreover, the 
following global consistency result holds: 
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Proposition 4.18. Suppose that the following hold for the function T and algorithm parameters 
in SGD #8: 


(i) Assumption (QSA1) holds. 
(ii) The probing signal satisfies (4.48b). 


(iii) VI is globally Lipschitz continuous, and T is strongly convex (review Section 4.4.1 for 
definitions), with unique minimizer 6* € R¢. 


Then there exists € > 0 such that for each € € (0,é), there is a unique root 6* of f-, satisfying 
|0 — 6*|| < O(e). Moreover, convergence holds from each initial condition: Jim GO. =t). O 
— 00 


Proof. The hypotheses of the proposition imply that Assumptions (QSA1)-(QSA2) of Section 4.5.4 
hold for f(6,&) = —G&EETVI(0) + O(e) and f- defined in (4.98). Since [ is strongly convex, it holds 
that there is eg > 0 such that there is a unique solution to GVI'(@) = z whenever ||z|| < €o, from 
which Assumption (QSA3) may be established for ¢ > 0 sufficiently small. Thm. 4.15 then implies 
that for each ¢ > 0, ©; converges to the unique root 6% of f- satisfying ||VI'(0*)|| = O(c). Due to 
strong convexity, we have: 


* * * * * 1) ok ok 
P(O") 2 PGE) + (VE(GE)) "(8 — 2) + 5 ll@e — 9 | 
for some 7 > 0. Therefore 


59: — 6? < T(6*) — P(8z) + (VP(8E))T (Gz — 6*) 
< |IVP@Z)INGS — 8" 


implying that ||6% — 6*|| < O(e). O 


4.7 Quasi Policy Gradient Algorithms 


It is not difficult to apply these techniques to the “tamer” examples considered in Chapter 2. 


4.7.1 Mountain Car 


We return to the example introduced in Section 2.7.2, and the simple policy (2.59). This cannot 
be optimal since it is clear the state will remain near z™" far longer than necessary from certain 
initial conditions. A more sensible policy will avoid this “western frontier”. Here is one suggestion, 
based on a threshold 6 in the interval [z™", z®*"]. 

With z(k) = 21(k) and v(k) = x2(k) denoting position and velocity at time-step k, consider 


‘ if 2(k) <0 


sign(u(k)) else oa 


The policy “panics” and accelerates the car towards the goal whenever z(k) is at or below the 
threshold 0. 
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Figure 4.11: Trajectories for the Mountain Car for two policies, and three initial conditions. 


The range of acceptable @ can be estimated by examining the graph of potential energy shown 
in Fig. 2.11, in the case of static input u(k) = 1. Its minimum is at 2° ~ —0.48. Denoting 
v(1) = Fo(z, u), we have by definition of 2°, 


0 = £Y1 (2°) = mgsin(9(z°)) —K=v(0)—-v(1), w= 1, = (2°,0)7 


That is, v(1) = v(0) = 0, which implies that the Mountain Car has stalled: x(1) = x(0). We 
therefore cannot allow a policy for which ¢(x°) = 1, which means that the policy ° is not acceptable 
if 9 > z°. We don’t observe infinite total cost in experiments that follow because we artificially 
bound the value function, as explained below. 

Fig. 4.11 shows trajectories of position as a function of time from three initial conditions, each 
with v(0) = 0, and with two instances of this policy: 6 = —0.8, and 6 = —0.2. The former is a 
much better choice from initial position z(0) = —0.6: we see that the time to reach the goal is 
nearly twice as long when using 6 = —0.2 as compared to 6 = —0.8. 

Let’s see how to adapt qSGD #1 to find the optimal value of 6. A discrete-time approximation 
of the recursion (4.96) is 


1 
On+1 = CO, — An+1 ~ Cont (Ya) (4.102a) 
Writ = On + ebn41 (4.102b) 


The question is, how do we define I’? 

The total cost in this example coincides with the time to reach the goal. For fixed initial 
condition 29 € R?¢, we might estimate the minimum of the corresponding total cost Jg(xq) over 0. 
A natural approach is episodic: at stage n of the algorithm, we initialize the car at state 7, and 
run the policy @° using @ = Wn41 = On +¢e&n11- On reaching the goal state we have a measurement 
of Jg(xo) for this value of 6. 

We make two modifications to this objective function. First, because we don’t know if the 
policy is stabilizing for all 6, we introduce a maximal value J™* of our choosing. Second, we are 
interested in more than one initial condition. Let v denote a pmf on X, and define as our objective 
function 


rig= Ss v(x") min{ J™, Jo(a')} 


where the sum is over the support of v. If v has kK points of support, then K experiments are 
required to obtain the measurement I'(Y,,41) at stage n + 1 of the algorithm. 
The value J™* = 5,000 was used in all of the numerical experiments that follow. 
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Figure 4.12: qSGD #1 for Mountain Car: the gradient-free optimization algorithm (4.102) using a large constant 


step-size. 


A second option is to introduce a second probing signal to generate a quasi-random sequence 
of initial conditions {xj :n > 0} that admit an ergodic average: for any subset S C X, 


N 
v(S) = lim S- 1{xG € S} 
n=1 


In this case we don’t require that v have finite support, so our objective requires the general 


notation, 


(4) = i; min{ J", Jo(x)} v(da) (4.103) 
In this case the recursion (4.102a) is modified as follows: 


1 
0,41 = 9, = An41—-Gens ili 
é (4.104) 


Tati < min{J™™, Jo (zo Hew. 


Pa 
Ss 


E[Jo(X)] 
X uniformly distributed 


== Histogram 
---- Gaussian approx 


Average cost 


on 


# of observations in ith bin 
S 
& 


— ©,, Typical trajectory 
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Parameter estimates 
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: 0 200 400 600 800 1000 7 
(a) qSGD #1 for Mountain Car: objective function, (b) qSGD #2 for Mountain Car: histogram obtained from 


and typical behavior of estimates independent runs, and typical behavior of estimates 


Figure 4.13: qSGD for Mountain Car: (a) q3GD #1 (b) qSGD #2 implemented using eq. (4.107) 


The experiments that follow are based on (4.104) option 2, though the first option (using 
K experiments per iteration of the algorithm) is most likely preferable in terms of accelerating 
convergence. The sequence {xj = (zj,vg))™} was chosen to cover the state space uniformly. This 


was achieved by introducing a second probing signal &* = (&, &) with 


E* = frac(nrz) , B= frac(nrs) ; 
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where “frac” denotes the fractional part of a real number, r,,7r,z are irrational, and their ratio is 
also irrational. The values r, = 7 and r, = e were chosen in all experiments. Then define 


up =(2ER—1) and ah = 2 + [zt — 2 ER (105) 


Fig. 4.12 shows one run using the constant step-size a, = 0.1, « = 0.05, and €, = sin(n). The 
large fixed step-size was chosen simply to illustrate the exotic nonlinear dynamics that emerge from 
this algorithm. It would seem that the algorithm has failed, since the estimates oscillate between 
—1.2 and —0.3 in steady-state, while the actual optimizer is 6* ~ —0.8. The dashed line shows 
the average of {O,,} over the final 20% of estimates. This average is very nearly optimal, since the 
objective function is nearly flat for @ near the optimizer. 

Fig. 4.13 (a) shows results from an experiment with decaying step-size a, = 1/n°”°, small 
€ > 0, and a minor change in the probing signal: 


En = sin(27¢ + n) (4.106) 


The phase variable ¢ was selected uniformly at random in [0,1] when repeated experiments were 
performed. The upper plot in Fig. 4.13 (a) shows the average cost (4.103), with N = 10* for a range 
of 8. The value of 6* was obtained by computing the minimum of this function. This approach 
to estimate the optimal threshold is simpler and more reliable than QSA techniques! Brute-force 
methods make sense if the dimension of @ is one or two; in complex situations we need a more 
clever search strategy. 

Fig. 1.2 shows results from 10° independent runs, each with horizon length T = 10*. In each 
case, the parameter estimates evolve according to (4.104), to obtain estimates {O4,: 1 <n < 
T, 1<i< 10%}. The two columns are distinguished by the probing signals. For QSA the probing 
signal & was a sinusoid, with phase ¢ selected independently in the interval [0, 1), in each of the 10° 
runs. The results displayed in the second column are based on an experiment with random probing 
signal, each uniform and independent on its respective range. The distribution of &, was chosen 
uniform on the interval [—1,1] for each n. The label “ISPSA” refers to Spall’s single observation 
algorithm based on random (i.i.d.) exploration (see the Notes section at the end of this chapter for 
history). 

qSGD #2 is easily adapted to this application: 


OH =O, Op iGe a Ls (4.107a) 
Wrst _ On “P Bead (4.107b) 


where the primes denote approximations of the derivatives appearing in (4.99a): 


Eu = 5 (Emti—En)s Tear = 5 (Tat — Ta) 
with 6 > 0 is the sampling interval, and [7,1 is defined by (4.104) if using option 2. 

A histogram and sample path of parameter estimates are shown in Fig. 4.13 (b), based on 
algorithm (4.107) with 6? = 0.5, and all of the same choices for parameters, except that the 
step-size was reduced to avoid large initial transients: a, = min(1/n°",0.05). This results in 
Ge. = ine’ torn 55. 

Based on the histogram, the performance appears slightly worse than observed for qSGD # 1 
in Fig. 1.2, but these outcomes are a product of particular choices for algorithm parameters. 
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4.7.2 LQR 


The Linear Quadratic Regulator (LQR) optimal control problem was introduced in Section 3.6 for 
models in discrete time, and revisited again in Section 3.9.4 for the continuous-time model, 


ig = Ar+ Bu 
The value function is defined as the minimum total cost 


[oe) 
Pa) = min C(xz, uz) dt, xr =2EX 
U 
0 
with quadratic cost (3.3): c(a,u) = a7Sx +uTRu. If finite valued, then this value function is 
quadratic, J*(x) = 27M*z, with matrix M* > 0 solving the algebraic Riccati equation (3.61), and 
the optimal input is linear state feedback: 


ut = —K*at = —R-1B™M*g* 


For simplicity we restrict to the single-input model, so that K* is a row vector (1 x n). 

If we do not have a model available, then we can approximate the optimal policy by first 
estimating (A,B) through system identification, and then proceed to solve the ARE to obtain 
the estimate of the optimal gain. Alternatively, we can use gradient-free methods to estimate K* 
directly, identifying a feedback gain K with 67. One great benefit of the latter approach is that 
we are free to impose structure on the gain matrix. For example, for a system with n states and 
n inputs, we might search for the n x n gain matrix that is optimal over all diagonal matrices. 
This means we consider u;(i) = —6;2;(7) for each 7 and t, and optimize over 0 € R”. Exercise 4.17 
provides an example to test qSGD on a particular example of this flavor. 

Consider the unstructured problem in which uw, = —6'2+, in which the goal is to minimize the 
objective function 


P(8) = $2 v(2) Jo(z0) 
where v is a pmf on the state space, and Jg is the infinite-horizon cost with feedback law determined 
by 0: 


[oe 
Jno) = f e(au) dt, ag ER" 
0 
To apply qSGD it would be necessary to approximate by the finite-horizon objective 
De. 
Jo(xo) = | c(xy, uz) dt + 2 Soxz 
0 


with So > 0 included to encourage a stable control solution. 
Before considering qSGD, we should first see if the gradient flow will be successful: 


d 
0 = —VT (8) 
It is simplest to return to the infinite-horizon setting, so that JT = co and Sg = 0. Analysis is based 


on the correlation matrix a 
Do = >) v(axh) xp {xG}7 
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and the solution Xg to the Lyapunov equation 


(A — BO")X6 t Xo(A Boryt + Mig = 0 


from which we obtain T'(@) = trace (SX) + RO™X60. 

We begin with bad news: it is known that T may be non-convex in @. An example in [78] shows 
that the domain on which ['(@) is finite is not convex, but this paper also brings good news. It has 
been known since the 1980s that VI (@) vanishes only at 0 = 6*, and that V(@) = T'(@) —T'(@*) is 
coercive (even though it is not everywhere finite-valued). Hence this serves as a Lyapunov function: 


£V (91) = —||VT (9)||? <0, whenever 94 4 0* 


In conclusion, based on this theory we can expect success using qSGD provided we project param- 
eter estimates (possibly required since neither TF or its gradient is Lipschitz continuous). 


4.7.3. What about high dimensions? 


The qSGD algorithms are easy to code, and quickly converge to an approximately optimal parame- 
ter for the example considered in this section. In high dimensions we can’t expect to blindly apply 
any of these algorithms. For example, consider the choice of probing signal (4.47), where 7 ranges 
from 1 to d = 1000. If the frequencies {w;} are chosen in a narrow range, then the limit (4.48b) 
will converge very slowly. The rate will be faster if the frequencies are widely separated, but we 
then need a much higher resolution ODE approximation to implement an algorithm. 

This challenge is well understood in the optimization literature. One approach to create a 
reliable algorithm is to employ block coordinate descent. This requires two ingredients: 


(i) A sequence of timepoints Tp = 0 < T) < Th <-:- 


(ii) A sequence of “parameter blocks” B, C {1,...,d} for each k > 0, where the number of 
elements dg in B; is far smaller than d. 


The qSGD ODE (4.94) is modified so that ©;(i) is held constant on the interval [T;,,,Tj,41) for 
i ¢ By, and ; 
4O,(i) = —a:[Vr(t)]i t€ Be, t € [Tr, Th+1) 


See [73, 282, 19] for this and more sophisticated approaches. 


4.8 Stability of ODEs* 


This “asterisk” on this section title indicates that it contains advanced material. The stability 

theory here is needed if you want to fully understand why the ODE methods surveyed in this 

chapter are “well behaved”, and the concepts will be extended to QSA in the following section. 
We begin with the proof of a central result and a simple corollary. 


4.8.1 Gronwall’s Inequality 
Proof of Grénwall’s Inequality, Prop. 4.2. Consider first the simpler equality (4.7): 


t 
at = Vt +f Bszs ds 
0 
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Observe that -y is continuous, under the assumption that z and G are continuous. We can “solve” 
this equation through the construction of a state space model with state x = z — 7, and output 
ze. From the integral equation 


t 
= | | a + ¥s| ds 
0 
we obtain the time varying linear state space model 
fat = Bix, + Br 5 2 = Ler Ve 


with initial condition xp = z—yo = 0. This scalar linear state space model, with “input” uz = 614%, 


has an explicit solution: 
t t 
Lt =a Us exp(/ Br dr) ds 
0 s 


We next turn to the inequality (4.6a), which we can write as 


t 
aaat f BsZ3 ds — 64 
0 


where 6; > 0 for each t. Define 7 = a; — 6; and obtain, with uz = 6474 as before, 


a= [rcoo( [a dr) ds < [bases fo dr) ds 


Using z = 21+ % < x_ + a, then gives (i). 
(ii) If the function @ is non-decreasing, then from part (i) and the assumption that @ is non- 
negative, 


t t 
a Sata | s.exp( | Bp dr) ds, O0<t<T. 
0 s 


This bound implies (ii) on substituting 


t t t 
| s.exn( | Br dr) as =exp( | Br dr) —1 O 
0 8 0 
Gronwall’s Inequality implies a crude bound that is needed in approximations: 
Proposition 4.19. Consider the ODE (4.2), subject to the Lipschitz condition (4.5). Then, 
(i) There is a constant By depending only on f such that 
[Sel] < (Bp + |/9ol])e** — By (4.108a) 
9 — Soll < (By + Ll|9oll)te*, tS 0 (4.108b) 


(ii) If there is an equilibrium 6*, then for each initial condition, 


9: - "|| < 0-H le“, £20 
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Proof. We present a complete proof of (ii) and (4.108a) (the proof of (4.108b) is similar). 
If there is an equilibrium 6*, this means that f(0*) = 0. The proof of (ii) then begins with (4.3), 
in the form 


t 
9-6 = 00-0" + | £(9,) ar, 0<t<T 
0 
Under the equilibrium condition and the Lipschitz assumption, 
I[£(3-) |] = []£8-) — £4") || < LI]8, — 6" 


Writing z = ||/d: — 6*||, this bound combined with (4.3) gives 
t 
a S204 f andr, 0<t<T 
0 


Grénwall Inequality then gives (ii): apply Prop. 4.2 (ii) using 6; = L and a; = Zp. 
To establish (4.108a), take any 6° € R?@ and use the Lipschitz condition to obtain 


EC) II < [£(@) — £(4°) II + TC") I 
< LI|6 — 4°|| + |1£(6") | 
< LI] + LI}6*|| + |£(6°) | 


With 0° fixed, define By = [|]6°|| + ||£(0°)||]/L, so that 
fl < Lil@ll+ By], AER! 


Applying (4.3) then gives 


¢ 
9:|| + By < [19oll + By +L i (|| -+ By] ar 


Grénwall’s Inequality establishes (4.108a), using z = ||9,|| + By for each t, and ay = zo. O 


4.8.2 Lyapunov functions 


The survey contained in Section 2.4.5 tells us much of what we need to know about Lyapunov 
functions. Given the goals of algorithm design, our interest is global asymptotic stability, so that 
the drift condition of interest is (2.38) with «© replaced by 6*: 


(VV(0),£(0)) <0, O46" 


This (and a few additional assumptions) allows application of Prop. 2.5 to establish convergence of 
8. 

Often the first step to establishing consistency of an ODE or algorithm is to show that the 
estimates do not “blow up”. The ODE is called ultimately bounded if there is a bounded set 
S CR? such that for each initial condition 0, there is a time T(99) such that 5; € 9 for t > T(o). 
This concept appeared in (4.83) in our treatment of QSA. 

There is naturally a Lyapunov condition to check: 


(VV(6),£(0)) <5, 068° (4.109) 
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Proposition 4.20. Assume that there is a continuously differentiable function V: R¢ + Ry 
satisfying (4.109) for some 5p > 0 and set S C R¢. Then, Ts(0) < 69 'V(0) for 6 € R4, where 


Ts(0) =minft>0:8€5}, %=0ER? 


If in addition S is compact, and V is inf-compact, then the ODE (4.2) is ultimately bounded. 


Proof. We take 69 = 1 without loss of generality (obtained by scaling the function V if necessary). 
The bound on the first entrance time Ts is part of Exercise 2.12! It follows easily from the 
sample path interpretation of (4.109): 


4V(%)<-1, O0<t<Ts(0), %=0ER! (4.110) 


Integrate each side from time t = 0 to t = Ty = min(N,Ts(6@)) (the minimum with N is required 
since we don’t yet know if Ts(@) < co). Next, apply the fundamental theorem of calculus: 


—V (80) < V(8r,,) — V(80) < —Tw 


giving min(N,Ts5(@)) < V(8o), and the desired bound on choosing N > V(8p). 
The crucial part of the proposition requires that we modify the set S. Since it is compact, and 
V is inf-compact, there exists N < oo such that S Cc Sy(N) = {0: V(0) < N}, with Sy(N) also 
compact. Hence, 
(VV(0),f(0)) <<-1,  @e€R*4, V()>N 


In fact, we should write V(@) > N, since this corresponds to 6 € Sy(N)°, but remember the left 
hand side is continuous. Because V(0;) is decreasing whenever 9; € Sy(NV)°, it follows that the set 
Sy (N) is absorbing, which means that 3; € Sy(NV) for all t > Ts5(0). O 


4.8.3 Gradient flows 

The proofs of Props. 4.11 and 4.13 rely on Lyapunov techniques and the following: 

Lemma 4.21. (Arzela-Ascoli Theorem) Consider a sequence of vector-valued functions {y" : 
n > 0} that satisfy these two conditions on a bounded time interval |a, b]: 


(i) Uniformly bounded: there is a constant B such that ||yj'|| < B for alln anda<t<b. 


(ii) Equicontinuity: for each « > 0 there exists 6 > 0 such that ||7j — 72 || < © for every n, 
and every t,s € [a,b] satisfying |t — s| < 6. 


Then there exists a subsequence {nz} and a continuous function y° such that 


li Mk _ 700) _ Q 
jim, max lly 1 || 


Equicontinuity holds under a uniform Lipschitz bound: ||’ — y2'|| < L|t — s| for fixed L, and 
all t,s, n, 


Proof of Prop. 4.11. We first note that the existence of an optimizer follows from the assumptions: 
the optimizer 6* for the primal exists because of the assumptions on T: it remains coercive and 
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convex when its domain is restricted to the set {@ : g(@) = 0}. The dual optimizer A* is then 
obtained via the first-order conditions for optimality: 


0 = VeL£(6*,*) = VI (6") + DTA 
Multiplying each side by D and inverting gives A* = —[DD™]~'DVT (6*), and then by construction 
yp (A*) = min £(6, r") =L0 MY H=TO 4a oe a=" 


The proof of convergence is similar to Prop. 4.7. Consider the Lyapunov function 


V(0,A) = 3)}6 — "|? + BIA — a (4.111) 
Applying the chain rule, 
£V(81,At) = (92 — 0", £92) + (Ar — A*, SA2) 
= — (9; = 0%, VoL(%:, \z)) + (At ae A‘, VaL(o:, na)) 
Convexity of £ in @ and linearity in A gives, respectively, 
LAO" tz) = Le, Xt) + (O* _ dt, VoL(d:, z)) 
L(82, At) = L(84,A*) + (Ag — A*, VaL£ (84, At)) 
This implies the derivative bound 
£V (91, At) < [L(0*, Ae) — £(84, Ae)] + [L(92, Az) — £(92, A*)] 
= [L£(O", Ar) — L(A", A*)] + [L(8", A") — L£(81, A") 
The first term on the right hand side can be simplified: 
LO", A) — L(0, A") = (Ae — A*)TG(") 
This is zero since 6* is feasible. The second term is non-positive by the saddle point property (4.34), 
giving 
LV (91,At) < L(O*, A*) — L(94,A*) < 0 


The inequality is strict when 0 4 6* since F and hence L(-,A*) is strictly convex. 
From this we obtain the following conclusions: 


(i) V(%:, Az) is non-increasing, which implies that {9;,A,} evolve in a compact set. 
‘a 

(ii) The bound 0 < ip {L(0:,A*) — £(6*, A*)} dt < V(8o,Ao0) is obtained for all T > 0 by 
0 

integration. This and (i) implies that Jim L£(04,A") = £(8", A") = T(6") (see Lemma 4.3). 

co 
(iii) Jim 0; = 0* as t > & since L(-,A*) is strictly convex. 
— 00 


To establish convergence of the dual variable we revisit (4.35a). Applying Lemma 4.21 with 
A = (Daa. fOn4t) we can establish equicontinuity on [a,b] for any a < b, and any sub-sequential 
limit y° is identically constant: y?° = (0*,0). Hence from the derivative equation (4.35a), 


. d . 
0= jm aut = Jim {—VP(t) — Det 
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It follows that any limit point A. of {Az} satisfies 0 = VI'(@*) + DTA, and, on multiplying each 
side by D, 
0 = DVT (0*) + DDT 


Under the full rank assumption, A. = —[DD™|~'DVT (6*) = 2*. O 


Proof of Prop. 4.13. We present a shorter proof, highlighting the differences with the proof of 
Prop. 4.13. 
We first obtain a representation for the Lagrange multiplier: from the first-order criterion for 
optimality, we have VgLl(6,A) = 0 at (0,A) = (0*,A*). That is, 
VI (0*) + D’A =0, with D = 0g (8), 
and thus by the full rank assumption, 
A* = —[DD™| "1 DVT (6) 


Based on this representation, the proposition is established once we have shown that {8;,A;} is 


bounded, and lim;_,.. 8; = 6*. As in the previous proposition this then implies that, with D; = 


Og (92), 
0= lim 49, = jim {—VP'(e) — DJAs} 


Under convergence of {9} and the full rank condition, this implies that {A;} is also convergent: 
: 3s ie TYy=1 — * 
Jim At = dim {DiD; } D:VT(8:) A 


To complete the proof, we use the quadratic Lyapunov function (4.111) to establish convergence 
of {8,}. Apply the chain rule to obtain 


V(Or,Ar) = V(80, Ao) + [ —{(0, — O*, VoL (Bz, Az)) + (At — A*, Val (84, At)) } dt 


T 
+f (Az — A*) "dy 
0 


The final integral can be bounded as follows: 


T fi 
[ (Ay — A*)Tdy = —A*T - dy <0 
0 0 


The equality follows from (4.39), and the inequality follows from the assumption that the integral 
and A* are each n-dimensional vectors with non-negative entries. 
Preceding as in the case of equality constraints, 


V (Or,Ar) — V(80,Ao0) < [icon — £(6*,A*)| + [£(0*, A*) — £(9:, A*)| dé 


T 
= f (ho(0") + [£(6",»*) ~ £(9,,x*)}} at 
0 
This implies that V(0r, Ar) is non-increasing, since each term in the integrand is non-positive: 
AL g(0*) <0 and =-L(6*,A*) — L(8, A*) < 0 


Moreover, the second inequality is strict whenever 9; 4 6*. From here we may follow the steps of 
Prop. 4.11 to conclude that 3; is convergent to 0%. O 
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4.8.4 The ODE at co 


We consider here an entirely different way to verify that the ODE (4.2) is ultimately bounded. 

The idea is pretty simple: to see if the ODE is ultimately bounded, we only need to consider 
values of 8 that are “very big”. Rather than bring out a telescope to examine these big states, we 
scale the state, and examine the resulting dynamics. To make this explicit requires that we make 
dependency on the initial condition explicit, writing $(t; 69) for the solution to (4.2) with initial 
condition $9 = 6. 

Let r > 1 by a scaling parameter (assumed large), consider the solution of the ODE with 
$9 = r6o, and scale the solution to obtain 


oF S 1-1 9(t; ro) 
We have 9 = 00 for any r > 1, and we obtain from (4.2), 
os = rt L9(t; roy) = rf (9(t; r0o)) 
On denoting f,(0) = r~!£(r@) for 6 € R%, this becomes 
tor = £,(8f) (4.112) 
Suppose that a limiting vector field exists: 
foo (0) & lim f-(@) = lim r“f(r@),  O€ R¢, (4.113) 
and define the ODE at oo as the limiting case of (4.112): 
GOP = fo0(8F), 97 ER® (4.114) 
We have f,,(0) = 0 by (4.113), so the origin is an equilibrium of (4.114). 


Proposition 4.22. Suppose that f is globally Lipshitz continuous, with Lipschitz constant L. 
Suppose that the limit (4.113) exists for all 6 to define a continuous function fo: R4 > R4. Then, 
if the origin is asymptotically stable for (4.114), it follows that the ODE (4.2) is ultimately bounded. 


To prove the proposition we first need to better understand the special properties of the solution 
to (4.114): 


Lemma 4.23. Suppose that the assumptions of Prop. 4.22 hold, so in particular the origin is 
asymptotically stable for (4.114). Then the following hold: 


(i) For each 6 € R4 ands >0, 
fxo(s0) = sf(6) 


(ii) If {97° : t > O} is any solution to the ODE (4.114), and s > 0, then {y, = sd?° : t > O} 
is also a solution, starting from yo = s9%° € R¢. 


(iii) The origin is globally asymptotically stable for (4.114), and convergence to the origin is 
exponentially fast: for some R < co and p> 0, 


SPI < Re*|9G|]|, 9G E R® 
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Proof. Consider first the scaling result in part (i): from the definition (4.113), with s > 0, 
f.3(68) = lim r ‘f(rs0) = s lim (sr)~'f(rs0) = sf.(0) 


The case s = 0 is trivial, since it is clear that f,.(0) = 0. This establishes (i). 
Next, write 
t 
oa, +f ig)? }ar 
0 
Multiplying both sides by s and applying (i) gives (ii). 
Asymptotic stability of the origin implies the following: (i) there exists « > 0 such that 


limyoo BP° = 0 whenever ||9§°|| < ¢, and (ii) the convergence is uniform in the initial condition. 
Consequently, there exists Tg > 0 such that 


[92° || < ge for t > To whenever ||8>°|| < € 


Next, apply scaling: for any initial condition 0§°, consider y, = s8?° using s = €/||93°||, chosen so 
that ||yo|]| =e. Then ||y|| < $¢ = 5|lyoll for t > To, implying 


SBI < F198 = tS To, 9 ERY 


This easily implies (iii) by iteration, as follows: for any t we can write t = nTo+to, with 0 < to < To, 
so that 
PPI < SOR 4t0ll S 27" 188 I 


Prop. 4.19 gives ||9?°|| < e”||99°||, so that 
9°] < 2e% 2- TY |] 99° | 


where the right hand side has been arranged to make use of the bound t < (n+ 1)7o, giving 
2-("+1) < exp(—log(2)t/Ty). We arrive at the bound in (iii) with R = 2e” and p = log(2)/Ty. O 


Proof of Prop. 4.22. Denote 
E(8) = ||£(8) — foo(4)I 


so that by Lemma 4.23, with s = ||6||, 
Tare (®) = llfs(/s) — f20(8/s) 


Because the functions {f, : s > 1} are uniformly Lipschitz continuous, and ||60/s|| = 1 by definition, 
the right hand side converges to zero uniformly as ||6|| > oo. That is, €(@) = o(||4||). 

Let’s think about what this means: for any ¢ > 0, there exists N(¢) < oo, such that €(0) < e]|6|| 
whenever ||@|| > N(e). From this we get the simpler looking bound: 


E(0) < Be+e|l9|| , where Bz = max{E(8) : |||] < N(e)} (4.115) 


For any initial condition 8g we compare two solutions based on this bound: 
t 
8 =do+ [ £(9,) dr 
0 


t 
ap = s+ f foo (9%) dr 
0 
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Write z, = ||, — 8?°|| and use the preceding definition to obtain, 


2s [ ite. J—Eg(02 3 jar+ fet, jars zedrt f £(0,)d 
0 


Groénwall’s Inequality in its second form (4.6c) holds, with 6, = L, and a; the second integral, 
giving 


t t 
i < elt f E(8,)ar cel f {B. + ell9-l|} dr 
0 (0) 


where the second inequality uses (4.115), with « > 0 to be chosen. Prop. 4.19 gives ||8,|| < 
{ By + ||Sol| }e%7, so that 


|S, — 09° || = z < te B, + cel { Br = ||9o|| } {2 te*} 
And applying the triangle inequality once more, 
[Bull < [P9F || + eL7e*"*||90|] + Beye 


where the value B-; can be identified by rearranging terms. Finally we can bring in Lemma 4.23, 
which implies the existence of Tp such that: ||9?°|| < $|/93°|| = 4||9ol] when ¢ > Zp. Hence, 


Srl < (4G +eL 1e4™) [oll + Ber 
Choose ¢ > 0 so small that the term in parentheses is no greater than 3/4: 


[Sz || < pl|8ol| + Bez , p=a/4 


Arguing as in the proof of Lemma 4.23, we can iterate to obtain for each integer n, and to < To, 


1 1 
[Prt +t0l] < [S40 [| + Topper s p'{ By + l|Sol| fe"? + To poet 


O 


1 
This establishes ultimate boundedness, and we can choose S = {6 : ||0l| < Top bet ~ i}. 
—p 


4.9 Convergence Theory for QSA* 


We consider in this section the general nonlinear ODE (4.44), subject to Assumptions (QSA1)- 
(QSA3) (introduced in Section 4.5.4). The proof of Thm. 4.15 is mainly a straightforward appli- 
cation of Grénwall Inequality, as stated in Prop. 4.2. Stability theory for QSA follows closely the 
stability theory of ODEs surveyed in Section 4.8. 

It is only when we come to convergence rates that we encounter mathematical challenges. These 
results also require further assumptions. First, a slight strengthening of (QSA2): 


(QSA4) The vector field f is differentiable, with derivative denoted 
A(0) = Oof (0) (4.116) 


That is, A(@) is a d x d matrix for each 6 € R4, with Aj,;(0) = aa (0). 


Moreover, the derivative A is Lipschitz continuous, and A* = ne is Hurwitz. 
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The matrix-valued function A is uniformly bounded over R¢, subject to the global Lipschitz assump- 
tion on f imposed in (QSA2). The Hurwitz assumption implies that the ODE (4.42) is (locally) 


exponentially asymptotically stable. 
The final assumption is a substantial strengthening of the ergodic limit (4.82): 


(QSA5) The probing signal is the solution to (4.79), with O a compact subset of Euclidean space. 
It has a unique invariant measure ut on QO, and satisfies the following for each initial condition 


Eo EQ: 
(i) For each @ there exists a function f (6, -) satisfying 


ty . 
(0, i) = | [f(0,&) — Fat + F(0,E,),  forall0<ty<t; (4.117) 


with f(0)= f 7@2)wlde) and 0 = I, FOmataa) 


(ii) The function f, and derivatives Oo f and gf are C! and Lipschitz continuous in @. In 
particular, f admits a derivative A satisfying 


a a a 
A(O, Ex) = [ [A(0. £4) ~ ACO) de + ACO), 0<to <ty 


where A(0,&) = 09 f(0,&) and A(0) = Oof (0) was defined in (4.116). Lipschitz continuity 
is assumed uniform with respect to the exploration process: for Ly < oo, 


f(",&) — £0, BI < Lyell’ — 4 
AG’, &) — A(9, €)|] < Lyla’ — 4l| 
AC’, &) — A, &)|| < Lyllo’- 9], 9, @E R47, EEO 


(iii) Denote Y; = [A(6*, Eo) — A(6*, &)] f(6*, &). The following limit exists: 


FE 
T# tim 5 f Yeat=— f A", 2)F(6",2) w(de) 
0 oO 


Tc 


and the following partial integral is bounded in t: 


~ to ~ 
qf Y,dr, where Y:=Vi,-—Y¥ (4.118) 
0 


Assumption (QSA5) (iii) is imposed because the vector Y arises in an approximation of the scaled 
error Z;. This assumption is not much stronger than the others. In particular, the partial integral 
in (4.118) will be bounded if there is a bounded solution {Y;} to 


nN ho =~ 
T= [ Yidt+ Vu, , for allO0 <to <ty 


to 
The remainder of this section is organized into five subsections: 


A The first subsection summarizes bounds on the rate of convergence for QSA—found in 
Thms. 4.24 and 4.25—along with an overview of their proofs. 
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A Section 4.9.2 concerns bounds between (4.42) and the QSA ODE (4.44), which is the foun- 
dation for much of the remainder of this section. 


A Criteria for ultimate boundedness for QSA is contained in Section 4.9.3. 
A Section 4.9.4 contains theory to justify Assumption (QSA5), and 


A Section 4.9.5 contains proofs of the main results related to rates of convergence. 


4.9.1 Main results and some insights 


The notation f in (4.117) is used to emphasize the parallels with Markov process and stochastic 
approximation theory: this is precisely the solution to Poisson’s equation (with forcing function 
‘i (-) = f(0,-) — f(-)) that appears in theory of simulation of Markov processes, average-cost 
optimal control, and stochastic approximation [144, 12, 250, 39]. For the one-dimensional probing 


signal defined by the sawtooth function & oy (mod 1), t > 1, a solution to Poisson’s equation has 
a simple form: 


a(2) =- / “[o(z) -g]dv+9(0), € [0,1) 


1 
where a= | g(x) dx 
0 


It will be useful to introduce new notation: for 6 € R¢ and T > 0, 


T 
=1(0) =f [F00, Ee) — FO) ae = FO, Eo) ~ F(0. Ex) (4.119) 
where the second equality follows from (4.117). The special case 9 = 0* deserves special notation: 
T 
BP = S(6") = f(0*, &) dt 
0 


where the right hand side is justified by the equilibrium condition f(@*) = 0. 


Theorem 4.24. Suppose that (QSA1)—(QSA5) hold, and the gain is ag =1/(1+t)?. 
(i) p<. The following hold: 


O; = O* + a;Z; + o(az) 


— 4.120a 
where - 
Ye (A 7X (4.120b) 
(ii) p=1. If f+ A* is Hurwitz, then the convergence rate is 1/t: 
O; = * + a:Z + o(a 
Pe ta) — = (4.120c) 
4.=Y+8,+0(1), wih Y=(|[I+A*|Y 
O 
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We turn next to PJR averaging (4.78). This is often presented as a two time-scale algorithm: 


Or = af (Or, &), (4.121a) 
1 
Lor = Tee (0, — OFF] (4.121b) 


What is crucial in this estimation technique is that the first gain is relatively large: Jim (1+ tha; = 
— 00 


oo. The solution to (4.121b) can be expressed as an approximate average: 


It may not make sense to average over the entire history, since this may include wild initial tran- 
sients. This is why (4.78) is usually preferred, which is restated here: for « > 1, 


T 
z @dt, T>0, h=T-T/x (4.121¢) 


OEFR — 
? T=-T Jn 


The choice of To is made so that 1/(T — Tp) = «/T. The set of equations (4.121) will be called 
Polyak-Ruppert averaging, but the theory that follows is restricted to (4.121c). 


Theorem 4.25. Suppose that the assumptions of Thm. 4.24 hold, so that in particular a, = 
1/(1+t)? with p € (5,1). Then, with Of defined by (4.121c), with k > 1, we have 


OF = 6 +arc(p,K)Y + O/T (4.122) 


where Y is defined in (4.120b), c(p,«) = K[1— (1—1/k)'?]/(1 — p), and where {Dz} is a bounded 
function of time. Consequently, the convergence rate is 1/T if and only if Y =0. O 


Exercise 4.11 provides a simple example for which Y ¢ 0, and averaging might actually slow 
convergence. Exercise 4.12 contains a more positive message: when using qSGD #1 we might have 
Y £0, but its norm is of order ¢ (the scaling used in the algorithm). 


Proof outline The first step is to explain why we can replace ©; with 6* in the definition (4.72) 
of the scaled error Z;. Prop. 4.26 provides justification, and makes clear the enormous difference 
between the choice of p = 1 or p < 1 when using the gain a; = 1/(1 + t)?: 


Proposition 4.26. Suppose that (QSA3) holds, and that f is C! with A* Hurwitz. Fir 09 > 0 
satisfying Real(A) < —@o9 for every eigenvalue A for A*. Then, there exists b > 0, B < co such that 
whenever ||9x. — *|| < b, the solution {9_ :T >to} of the ODE (4.42) satisfies 


[9x — || < BllSxq — "|| exp(—@olt—to]), TZ To 


Consequently, the following hold for the solution to the ODE (4.71) using the gain a, = 1/(1+t)?: 
If |Or = 0* || < b, then 


(i) If p =1 then ||©; — 6*|| < Blj9xy — || [(1 + to) /(1 + t)]2° for t > to. 
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(ii) Using anyO<p<1, 
|: — O*|| < Bro |/Bxy — Pll exp(—ao(l—p)-(1+4)?), = t >to (4.123) 


where Bi, = Bexp(oo(1— p)~*(1 + to)**). O 


A significant conclusion is that when p < 1, so that eq. (4.123) holds, then 
1 
Zt => —(©; = 6°) + on (4.124) 
at 
with e7 = [0* — ©;]/a; vanishing quickly as t > oo. 


Prop. 4.27 shows how the nonlinear ODE is naturally “linearized”, provided it is convergent. 


Proposition 4.27. Suppose that (QSA1)-(QSA4) hold, and that solutions to (4.44) converge to 
0* for each initial condition. Then, the scaled error admits the representation 


Li _ [rel + a, A(®,)| Zt + arly + E 5 Zt = 0 (4.125a) 


where ry = hi log(az), E, = f(z, Et) — f(z), and bounds on Ay are distinguished in the following 
cases: Setting A* = A(0*), 


(i) With a, =1/(1+2), 


£7, = at [I + A*| Zt + arg + =, (4.125b) 
where j 
= 1/2 

|Ael] = O(—[@e — 8?) = o(lZel) (4.125c) 

(ii) For any p € (0,1), using the gain ay = 1/(1 +t)? gives 
£2, = ar, A* Z; + arly + =, (4.125d) 

where Ay = A‘) + (rz/at)Z4, with 
©) — of11e, —@,II? 
|A?I| = O(- lle. ll’) (4.125¢) 


Once again ||A;|| = o(||Z:||), since re = p/(1 +t) in this case. 


Note that ry = 4 log(a;) is always non-negative under (QSA1). 

The challenge in applying Prop. 4.27 is that the “noise” process E, appearing in (4.125a) is non- 
vanishing, and is not scaled by a vanishing term. This is resolved through the change of variables: 
denote for t > 0, 


def 


Yt = Zt — 3:(Or) (4.126) 
where =;(@;) is defined in (4.119). It is shown in Prop. 4.38 that Y; solves the differential equation 
£Y¥, =a [A*Y + AL —V¥, 4+ A*Ei] t+ re[¥e + EY] 


where =} = =3(6*), and ||AY || = o(1 + |[Y;||) as t > 00. 

It will be shown that this implies convergence: limy—yoo Y; = Y. This leads easily to the proof 
of Thm. 4.24, from which Thm. 4.25 then follows—the details are established in the remainder of 
this section. 
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4.9.2. ODE solidarity 


The proof of Thm. 4.15 requires a precise ODE approximation which is explained in this subsection. 
For this we recall the “averaged” vector field introduced in (4.45): 


F(@) = tim = f(0,%) dt, for all 0 R4. (4.127) 


T> 00 0 


The first step in the theory is to find assumptions to ensure that the limit exists. Following this, 
the solutions to the ODE (4.42) and the QSA ODE (4.44) are compared, and shown to converge 
to the same limit provided both have bounded solutions. 
Recall that the starting point in an ODE approximation is the temporal transformation (4.73). 
The time-scaled process is then defined by 
6, £@(s-(z)) = 2, (4.128) 


t=s—1(t) 
The chain rule of differentiation (and using “dt = a; dt”) gives 


£A(s(2)) = F(@(s (0), (C0). 


That is, the time-scaled process solves the ODE, 


d - = _ 
zo = f(@(s7'(t)), E(s7!(1))) . (4.129) 


The two processes © and O differ only in time scale, and hence, proving convergence of one proves 
that of the other. In this subsection we deal exclusively with ©; it is on the ‘right’ time scale for 
comparison with 0, the solution of (4.42). . 

Define 07,, w > T, to be the unique solution to (4.42) ‘starting’ at ©;: 


49% — f(9), w >t, OF = Oy. (4.130) 


We have the suggestive representations: 


a a T+U a 
6.4 = Ort : f (Ow, E(s7'(w))) dw 


ee (4.131) 
9, =6,+4 | Ans, waet. 
us 
Proposition 4.28. Assume that © is bounded. Then, for any T > 0, 

T+U a ae 
lim sup | i [F(6w, &(s4(w))) — FGw)] de] = (4.132a) 
TF YE[O,T] Jt 
lim sup ||6,4. —9%,,|| =0. (4.132b) 


TOO yElO,T] 


The proposition easily establishes Thm. 4.15: 
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Proof of Thm. 4.15. Under the assumptions of the theorem, there exists b < oo such that ||87|| = 
\|Ox|| <b, for t > Tp. By the definition of global asymptotic stability, for every ¢ > 0, there exists 
T- > 0 such that 

Pty —O8"||<e forallu>Te., whenever ||8<|| <6 


Prop. 4.28 gives, 


lim sup ome — 6*|| < limsup ||On.75 — t+ || + lim sup ||87,7, — 4" || <e. 
TCO TT O0O TOO 


We obtain the desired limit, since ¢ > 0 is arbitrary. O 


The following pages are devoted to the proof of Prop. 4.28. We begin with a crude bound, 
generalizing Prop. 4.19 to the QSA ODE. The proof of Prop. 4.19 extends to Lemma 4.29 with 
only notational changes. 


Lemma 4.29. Consider the ODE (4.129), subject to the Lipschitz condition in (QSA2). Then, 
there is a constant By depending only on f such that 


©, — Gol] < (Bp + Lfl|Golte***, > 0 (4.133) 


O 


The next step is to obtain a version of (4.132a) with ©,, frozen. The bound (4.134) in 
Lemma 4.30 is a strong version of the Law of Large Numbers (LLN) for the time scaled process 
{E(s-!(t))}x>0. Notice the difference with a conventional LLN. Here, the interval of integration is 
some arbitrary fixed T’, and the averaging becomes more accurate as the interval is shifted towards 
infinity. 


Lemma 4.30. For any T > 0 and @ € R¢, 


where bo(@) is given in eg. (4.82). 


t+T 
[2.20 — FO] du] <00(0)ef, of Ba (4.134) 


t=s—!(t) 


Proof. With fy(0) = f(0,&w) — f(@) for each w and 6, denote &; se i, fw(0) dw. By the assumed 
bound (4.82), 
|Exl| < b0(A), = tS 0 (4.135) 


The following integral is simplified using integration by parts: 


ty m ty ty 
/ arfr(0) dt = atEt = / ayer dt 
to t 


0 1) 


Taking the norm of each side gives, by the triangle inequality, 


ti 7 ty 
fo ahto) a| < ail€igl + an Ertl + fall 
t 


) to 
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Applying (4.135) gives 


ti 


[ ar f;(0) a| < 2az,bo(A) — ww(0) | ay, dt < 3a1,b0(0) 
t t 


0 0 


where, in the first inequality, we have used the fact that a; is non-increasing, so that |a/,| = —a}. 
Letting t) = s-'(t), tr = s (1 +T), t = s-+(w) (giving dw = ardt), yields by a change of 
variables in integration: 


| 


THT ty 
[6.8 Mwy) = FO) de | = |] f° ari(0) ae] < Bag b0(6) = efb0(0) 


O 


Proof of Prop. 4.28. The two parts of the proof establish the two limits in the proposition. Recall 
that these two limits are subject to the assumption that © is bounded. 


Proof of (4.132a): Denote 


Tu des ees 
o [ [f(Gw, E(s-'(w))) — FGw)] aw 


To establish (4.132a) we must show that this converges to zero as T — oo, uniformly for v in 
bounded intervals. 

Fix 6 > 0 and denote Tt, = 1+ kd for k > 0. As in the theory of Riemannian integration, the 
Lipschitz conditions in (QSA2) imply the following bound: 


Ny-1 pty té ~ aes 
= 3 | LF Gry, E(s?(w))) — F(Gx,)] dw + et 


k=0 


where n, denotes the integer part of v/d, and |let|| < bpvd for some constant by < co. The bound 
is uniform in Tt under the assumption that © is bounded. 
Lemma 4.30 and the triangle-inequality imply the bound 


Ny—1 Ny—-1 
Ettull S de Ex bo(O ,) + bnv6 < ef S > bo( bo(Ox,) + b1vd 
k=0 


Let b. < oo denote a constant satisfying by (Ox) <b. for all t. Then, 
v 
Etpull S bee + brvd 
Lemma 4.30 then implies that for any T > 0, 


limsup sup ||Ez,,|| < 6,76 


tT 00 ~veE(0,T] 


This completes the proof of (4.132a), since 6 > 0 was arbitrary. 
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Proof of (4.132b): The result is very similar to Lemma 1 in Chapter 2 of [65]. It is a refinement 
of Lemma 4.31, and its proof begins with the representation (4.137) for Et, = 01, — Ow, w > T. 
The Lipschitz conditions in (QSA2) imply the bound: 


Tru 
EZ oll < OT 4+ Ly | ES || aw 
is 19 


where 
w+y 


[F(Ow)) — f(Ow, E(s*(w)))] dw 


def 
6° = sup max 
>t OSv<ST 


Prop. 4.2 then gives ||E,,,|| < e167 for all t, and all 0 < v < 1. The error term 6* vanishes as 
T — oo due to (4.132a). O 


Local solidarity We conclude this subsection with a weak but general bound on the difference 
between © and 9*. The inequality does not require boundedness of solutions to the ODEs, and is 
useful for small T > 0 to establish a Lyapunov drift condition for ©. 
Lemma 4.31. For some b < co and anyO <T <1, 

Grr — Styl] < ef bo(Ox)ef + 6(1 + |]Oxl])T? (4.136) 


where bo(9) is given in eg. (4.82), and Ly is the Lipschitz constant introduced in Assumption 


(QSA2). 


Proof. Denote Et, = 9%, — Oy for w >t. The pair of identities (4.131) give 


t+T = THT a 
TT = | [f(Ow) — fw, (8 w)))] dw +f [F(8.) -— f(Ow)|dw, t%7T20. (4.137) 
T T 
The Lipschitz conditions in (QSA2) are used to bound the integrands: 


| F(Gw)) — FOx))|| S LgllGw — Ox 
If (Ow, &(s(w))) — FOr, &(3(w)))Il S Lp ll@w — Oxll 
FL) — Fw) Il < NET 


Consequently, for any T > 0, 


lens | FOo)— FOe£etwyae| 
saa fh 6. -6, dw ty [Nes du 


TH+T os / ae a 
caftLy f |eslidw, ap Do( Belek + 2L5 [crn — Gall dw 
T 


Hence by Grénwall’s Lemma, in the form (4.6c), 


Tog 
Etzrll < wpev! 
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Using the same proof to derive (4.108b) via Grénwall’s lemma, we have 


|Ortw — Oe] < (By + Ly||Ox||wees” 


Lyw 


Increasing e to e/f7 for the range of integration gives 


T ae a sh A 
2 | Gerw — Oy|| dw < 2(By + Lyre" f wdw = (By + Lyl|GqllT2e2!7 
0 0 


Hence a : 
oP < bo(Ox)ef + Ly(By + Ly||O,||)T?e"s* 


which completes the proof, since 0 < JT’ < 1 by assumption. O 


4.9.3 Criteria for stability 


The first step in establishing convergence of QSA is to show that the solutions are bounded in 
time. Two approaches can be borrowed from the dynamical systems literature: Lyapunov function 
techniques, or the ODE at oo introduced in Section 4.8.4. 


Lyapunov criterion 

(QSV1) There exists a continuous function V : R? > Rx and constants co > 0, 69 > 0 such that, 
for any initial condition Jo of (4.42), and any 0 < T < 1, the following bounds hold whenever 
|[9s|| > co, 


T 
V(Beer) V(x) <—8o [|u| at 
0 
The Lyapunov function is Lipschitz continuous: there exists a constant Ly < oo such that 
\|V (8") — V(0)|| < Ly||@" — 4] for all 0, 6’. 


Assumption (QSV1) ensures that V(9;) is strictly decreasing whenever 9; escapes a ball of radius 
co. If V is differentiable then this assumption implies 


£V (91) < —do||9:||, whenever ||94|| > co 


The integral form is chosen since sometimes it is easier to establish a bound in this form. In 
particular, the proof of Prop. 4.34 below is based on the construction of a solution to (QSV1). 


Verifying (QSV1) for a linear system. Consider the ODE (4.42) in which f(x) = Ax with 
A a Hurwitz d x d matrix. There is a quadratic function Vo(x) = «7M with M € R?¢ satisfying 
the Lyapunov equation MA + ATM = —I, with M > 0. Consequently, solutions to (4.42) satisfy 


V2(92) = —|9¢ll? 
Choose V = Vo, so that by the chain rule 
1 1 1 
© 2 /V2(8) 2/Mnax 


where Amax is the largest eigenvalue of M. This V is a Lipschitz solution to (QSV1), for any 
co > 0. O 
We first establish ultimate boundedness under a variant of (QSV1): 


eV (8) = lull? < — laa 
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Lemma 4.32. The solution to (4.129) is ultimately bounded if, for some T > 0, 0 < 6, <1, and 
T,0< c, 


n 


V (Orr) — V(Ox) < —61||Orl] , 


for all t >, ||Ox|| > b. o 


Proof. For each initial condition Oo = 6 and T> To, denote 
7 =7(6,T) = min(v > 0: ||Octo|| < 5) 


where To and b are defined in the lemma. If \|O-| < 6 then 7 = 0, and if \|Ox-ol| > b for all v > 0, 
set 7 =o. For m € Z1, define 7,, = min{r,m}. Then, 


T+HTm 
Se -5 | Gaol) dw 
is oh 


> [Gor - VG.) a 


The right hand side is independent of m, which establishes the upper bound 


1 tt+T 
— ) dw. 
rsp f Vie. du 


Under the Lipschitz assumption on V, Prop. 4.19 can be applied to establish that for some finite 
constant by - 
T < by (1 + ||Oxll) 


Hence 7(6,T) is everywhere finite. 

Denote 6; = sup{||Or+.| Be tia eS Fs \|O-| < b+1}. That is, b} bounds the maximum 
norm of any excursion of © that begins at time t if 0, € S = {6: ||6|| <b +1}, and ends at the 
arrival time to the set So = {6 : ||4|| < b}, denoted t+7(6,T). Since every trajectory enters Sp C S$ 
for some time t > 79, it follows that ||O,|| <b; for all t sufficiently large. O 


Proposition 4.33. Under (QSV1) the solution to (4.129) is ultimately bounded: there exists 
b < oo such that for any Qo = ©, limsup,_,,, ||Ox|| < 6. 


Proof. Recall that V is the Lyapunov function and co > 0 is the constant introduced in (QSV1). 
For 0<T <1, |/Oz|| > co +1, 


V (6.47) — V(x) = V(Gr47r) — V8 pp) + V(8E pp) — V(8E) 
<|V(Gryr) — VOT) + V (8,7) — VOT) 
Ls\|Orer — 92,71] — oT ||Gel| 


< {e'Sbo(Or)ef + B(1 + ||Gxl|)T?} — 407 ||Gxll, 


IA 
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where the second inequality follows from the Lipschitz assumption on V and the last inequality 
uses Lemma 4.31. Recall that bo is Lipschitz continuous, and ef = o(1). It follows that we can 
choose T' > 0 sufficiently small and To sufficiently large so that 


V(Gr-r) — V@x) < -Z6oT||Oxl], >t |]@xl] > co +1. 


Lemma 4.32 completes the proof. O 


ODE@oo When extending the techniques of Section 4.8.4 we require the vector field at oo asso- 
ciated with the average vector field: 


fx.(0) = lim rtf(rd), OER (4.138) 


Proposition 4.34. Suppose that the following hold: 
(i) The limit (4.138) exists for all 0 to define a continuous function fr: R? > R4 
(ii) The origin is globally asymptotically stable for the ODE at oo: 


SOP = fo(9P), OER" (4.139) 


Then, there is a Lipschitz continuous function V that satisfies (QSV1). 


The main step in the proof of Prop. 4.34 is to show that the assumptions of the theorem imply 
that the ODE (4.42) is ultimately bounded. Lemma 4.23 is repeated here in the new notation for 
convenience: 


Lemma 4.35. Suppose that (QSA2) holds, and the limit (4.138) exists for all 6 to define a 
continuous function fs: R¢ > R*¢. Suppose moreover that the origin is asymptotically stable for 
(4.139). Then the following hold: 


(i) foo(s0) = sfoo(0) for each 0 € R¢ and s > 0. 


(ii) If {89° : t > O} is any solution to the ODE (4.139), and s > 0, then {y, = sd?° : t > O} 
is also a solution, starting from yo = sdo° € R¢. 


(iii) The origin is globally asymptotically stable for (4.139), and convergence to the origin is 
exponentially fast: for some R < co and p> 0, 


SPI < Re*|9G|], BG E R® 


The following is essentially one step in the proof of Prop. 4.22. 


Lemma 4.36. Under the assumptions of Lemma 4.35, for each T < oo ande € (0,1), there exists 
Kr < oo independent of ¢, and Br(e) < co such that for all solutions to eqs. (4.42) and (4.139) 
from common initial condition %o, 


I: — 8F° || < Br(e) + Kr[1 + |[Pollle 
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Proof of Prop. 4.34. Choose T > 0 so that ||9?°|| < $||@|| when t > 7, for any solution to eq. (4.139), 
from any initial condition 9° = 6. We then define 


T 
ve) = | orl ae, m= 9 
T 
V(8) =a V®(%)dt, 9 =8 
0 


The Grénwall Inequality implies that each is a Lipschitz continuous function of 0. Moreover, 
applying Lemma 4.35 it follows that the first is radially homogeneous: V™(s#) = sV™(@) for each 
9 and s > 0, and satisfies the lower bound for some 6 > 0: 


ve(9) = lal 


Consequently, this is a Lyapunov function for the ODE@oo: for each initial condition 39° = 0, 
T 
V™ (oF) =| Perl dt < 5V°(0) < V~(A) — 561I4| 


The next step is to show that a similar bound holds with 37 replaced by #7. Let Ly denote 
the Lipschitz constant for V°. The bound above combined with Lemma 4.36 gives 


V° (Sr) < V(6) — $6|\6l| + Lv (Br(e) + Krill + [6llle) 


Fix ¢ € (0,1) so that 
Ly Kre < 6/4 


giving 
V°(8r) < V*(8) — $6||6|| + Ky 
with Ky = Ly (Br(e) + Kr). 


To complete the proof, write 


T s 
V(9,) = / V(8,) dt + | V™(8742) dt 
Ss 0 
The preceding bound gives 
V*(Sr4t) < VO (82) — 551/92 || + Ky 


so that : 
V(B.) < V(G0) — 45 [dull dt-+ Ki 
0 


This bound implies (QSV1). O 
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4.9.4 Deterministic Markovian model 


The goal here is to understand Assumption (QSA5) in a simple setting. For this we suppress the 
variable 6 in the function f(@,&) appearing in (4.117), and adopt the new notation: 


iE) = fol) —d]de+ (Ey), OS toh (4.140) 


to 


We are essentially fixing a single index 7 and parameter 0, and setting g(&) = Fi(0, Ea) 

This is a version of Poisson’s equation—see Section 4.9.1 for further discussion. The function g 
is the solution (known as the relative value function in some applications), g is the forcing function, 
and g its steady-state mean. Lemma 4.37 concerns the special case (4.80) for which H is a linear 
function of z € C*. 


Lemma 4.37. Suppose that g: CX + R admits the Taylor series representation, 
g(z) = s Game, Pe zeEO, (4.141) 
Ny K 


where the sum is over all K-length sequences in ZT, and the coefficients {dn,,...ng} C C* are 
absolutely summable: 


y Panel ce (4.142) 


NW gee MK 
Then, with & defined in (4.80), 
(i) The ergodic limit holds: 


1 T 1 1 ; ; 
g= lim zf gE) at = | af Kew. se) dt, ---dtx 
(0) 0 0) 


where g = ao (the coefficient when n; = 0 for each i). 
(ii) There exists a solution g: CX — R to (4.140). It is of the form (4.141): 
g(x) = » Criccieth Ee (4.143) 
N10 MK 


in which |@n,,....ne| S |@ni,....nx|/o1 for each coefficient. 


Proof. Complex exponentials are used to obtain the simple formula: 


9(Et) = So Ony,...nx exP({riw1 +--+ + nxwK} it) 


N11 NK 
The absolute-summability assumption (4.142) justifies Fubini’s Theorem: 
ty ty 
| [9(Ee) -gldt= SY) any,...nx / exp({niw + +--+ nw }jt) dt = 9(Et.) — (Et) 
to Ny NK to 


where g is given by (4.143) with Gp = 0 (that is, nz, = 0 for each k), and for all other coefficients, 
Gny,....nK = ips cstage 1 RAW apatite nkwK} lj. O 
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The lemma then justifies (QSA5) provided f;(0, -) satisfies the Taylor series bound for each 
i and @, along with the derivatives 385 fi(@, -), for each 7,7. While an explicit formula for f is 
not required in any algorithm, bounds may be valuable in finer convergence rate analysis of QSA 
ODEs. In particular, the approximation of the scaled error Z ee a, (Or —@,) obtained in Thm. 4.24 


depends on t(O, E1). 


4.9.5 Convergence rates 


We begin with a proof of Prop. 4.27. Recall from Lemma 4.14 the identity ©; = 9, for t > to and 
def 
T= St => To. 


Proof of Prop. 4.27. Taking derivatives of each side of (4.72) gives, by the product rule, 
1 _ 
d d 
act = (— (O 6,)) 
1 = — 
(=p ar) (G1 — ©) + FOr, &) — FG) 
t 


=1r:Z,+ f (Oz, &) - f(©,) 


where in the final equation we used the chain rule for the derivative of a logarithm (recall that 
r, = —4 log(a;)), and the definition of Z;. 
On adding and subtracting f(©;), we arrive at a suggestive decomposition: 


42, = Zr + | F(Ox) — F(Or)] + [fF (Ox, &) — FC] 
ee 
R;: almost linear =: bounded disturbance 


That is, under the assumptions of Prop. 4.27, 
R, = A(©;) [O; — @] + €; 
where, under the Lipschitz condition on A = Oof, 
lleel| = Oz — ©x||?) = o(ael| Zl) 


This completes the proof of (4.125a), with A; = e}/a;. 
If ag = 1/(1+t) we obtain ry = 1/(1+t) = a. Equation (4.125a) thus implies the approximation 
(4.125b), where the definition of ||A;|| is modified to include the error from replacing A(©;) with 
its limit A* = A(0*). 
Consider next the “larger gain” a; = 1/(1+ t)?, with p € (0,1), so that r, = p/(1+t). The 
simpler approximation (4.125d) follows, in which ||A;|| has an additional term: once again, we 
replace A(©;) with its limit A*, and also use the approximation r; = O(1/t). O 


Recall the change of variables: Y; af Z, — =;(©z) was introduced as a means to remove the 


non-vanishing noise =; in (4.125a). Prop. 4.38 establishes a differential equation for Y, similar to 
the QSA ODE (4.44). 

The ratio r;/a; is bounded in t for the standard choice a; = g/(1+t)? (recall from the definition 
in Prop. 4.27). 
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Proposition 4.38. Under the assumptions of Thm. 4.24, suppose that r:/azy < b for a constant 
b, and allt > 0. Then, the vector-valued process Y satisfies the differential equation, 


fY, = az [A*Y, + Af — Ve + A*Ei] + re[Me + SY] (4.144) 
where =} = =(6*), and ||A} || = o(1 + ||%il]) as t > 00. That is, for scalars {ef}, 
Al sep {1 +(Mll}, — t2 to 


with ef + 0 as t > oo. 


Proof. Using the chain rule, we have 


4. {5}(©z)} = {f (Ot, &) — f(Ox)} + OE} (©z) - {40x} 
= Ey + Oo=1 (Or) - {arf (Or, Ex)} 


where the second equation follows from the definition =, # f(©r, &) — f(©z), and the dynamics 
(4.44). Rearranging terms we obtain 


Er = {=} (©,)} = arY:(Oz) (4.145) 


where Y;(©z) = Op=; (Oz) - f(z, &) 
The following is then obtained on substitution into (4.125a): 


&Y1 = at [A(Ox)¥; + At — Ve(Oz) + A(Or)E:(x)] + rel¥e + =i (O:)] 
To go from this ODE to (4.144) we must bound the error: 
Ar = A? +A? 
where aXe = Ay + [A(:) = A*| (Y + =) 
a \(o = Tt. = 
is = A(©;) (=; (2) — =) — (Y1(©z) _ Y;) + a, Oe) — =) 
We have ||A;|| = o(1 + ||Y;||) because of the prior assertion that ||A;|| = o(||Z%||) as t > oo, and 
the assumption that =;(©;) is bounded in t (recall eqns. (4.117) and (4.119)). Prop. 4.26 combined 
with Lipschitz continuity of A then implies that ||A?|| = o(1 + ||¥;||). 


To complete the proof, we must bound the error in replacing ©; with 6* in each appearance in 
A?. The representation (4.119) combined with (QSA5) implies that for a constant L, 


|| A(z) — A*]| < LO, — "|| and /E,(@r) -— Eyl] < LO.- |], +20, 


and hence both error terms are vanishing, and also ||V;(@;) — Y;|| = o(1) by Lipschitz continuity 
of 09=;(0): from (4.119) and (QSA5): 


OpE(8) = A(O, Eo) — A(O, Ex) 


These bounds show that ||AP|| = o(1). O 
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Proof of Thm. 4.24. First rewrite (4.144) as 


r. r 
Yi =a, |(A* +=), + AF — V+ (At + SDE; 
t t 


where r;/a; = o(1) for p < 1, and r;/a; = 1 if p = 1. The above can be regarded as a linear QSA 
ODE with vanishing disturbance. Let k(p) = 1{o = 1} (equal to zero for p < 1, and «(1) = 1). 
Under the condition that A* + k(p)I is Hurwitz, the proof of Thm. 4.15 can be used with no 
significant changes to establish convergence: 


= 1 7 
Y = lim % = lim z/ [=i + [A* + k(p)I]~*0e=; - f (0, Ex)] dt = [A* + k(p)I]-'Y 
0 


too T>0o 


ae 
where the second equality holds because jim a | =; dt = 0 under (4.119), and from the definition 
0 


00 
of Y in (QSA5). This gives the coupling result Z; = Y + i+ o(1). 
The second approximation in (4.120a) follows from the first: applying the definition (4.72) gives 


OQ, = e; + ar [Y + El + o(az) 


For p < 1 we have ©; = 6* + 0(a;) since 0; converges to @* faster than t~% for any N. 
For p = 1, Prop. 4.26 (i) implies that ©; = 6* + O(t-®) where Real(A) < —go for every 
eigenvalue A for A*. Therefore, @; = 6* + o(t~') if J + A* is Hurwitz. O 


The proof of Thm. 4.25 is broken into three lemmas that follow. The assumptions of the 
theorem are assumed throughout. In particular, in the definition (4.121c) of OF it is assumed that 
ay = 1/(1 +t)? with p € (4,1), and Tp = (1—1/k)T. 

The first step is to approximate the estimation error as 


1 = 1 z 7 
OF — o* = | ©; — 0*] dt = | ©, — ©;,] dt + o(1/T? 4.146 
if 0 a [ld = pam f 1 Oda + ot (4.146) 
where p > 1 is fixed but arbitrary. This bound follows from Prop. 4.26 (ii) (recall (4.124)). 

The vector valued process Op appearing in (4.122) is constructed as part of the proof. It is the 
sum 


Op = [A*] {Yr — Yq, + WS + WE} + o(1) (4.147a) 
T ‘ae T - 
with wee i rZidt and wh | ar; dt = i a, dY} (4.147b) 
To To To 


where, from (4.118), 
~ = ~ t ~ 
Y,=V:-Y and q= [ Y,.dr 
0 
with Y the ergodic mean introduced in (QSA5). The first term {¥%} is bounded because {Z;} is 
bounded, and r; = 1/(1+ #), giving 


Well < log(T/To) sup || Z:|] = log(«/(« — 1) up || Zz] < 00 


A proof that {W%} is bounded is postponed to Lemma 4.41. 
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Lemma 4.39. Under the assumptions of Thm. 4.25, 


fe — Gy] dt = [AY Zr — Zry — i: 


Edt +W% + o(1)} (4.148) 
To To 


Proof. Combining the definition a;Z; = ©; — ©; and 
fLi = aA*Z, + aA, + = 


(see (4.125d)), gives by the Fundamental Theorem of Calculus: 


T Lt a 
ZT = Z%B% = A* [O; — O;| dt + i [a,Ay + | dt (4.149) 
To To 


Applying Prop. 4.27, the scaled error term can be expressed 
arAt = a, Al?) + 74Z4 


Thm. 4.24 (i) implies that r,||Z;|| = O(1/t) and a;||A || = O(\|@, — ©;,||2) = O(a?), so that (4.148) 
follows on rearranging terms in (4.149), and using the definition (4.147b). O 


Lemma 4.40. Under the assumptions of Thm. 4.25, 


1 
T= Th 


rT 
Of — 9* = [Ay {Yr yk | a, Y, dt + W% + o(1)} (4.150) 
To 


Proof. Recall (4.145) and (4.119), which give 


T 


a 
i = i= Eten = Gx = | a:¥,(Q,) dt 
To To 


TE T 
= [Eh.(@r) — 2h, (On) — 7, asY edt + | a0 ((|@, — © ||) at 
To To 


ay 
where | azO(||Oz — ©;||) dt = o(1) for p € (4,1). Recalling the definition Y; SZ, — Zi(@;) in 
To 
(4.126) gives, 
T © 
Zr — ZT — i Er dt =Yr—-Ymt+ | arV; dt + o(1) (4.151) 
To To 


Combining (4.148) and (4.151) completes the proof: 


1 r _ 
PR gk _ 1s 
Of - 0 =a a [61 -Gi]at + of1/T) 


ae 
Th 


O 


T 
AT {Ye = Yan +f atedt + vp + o(1)} 
To 
The integral on the right hand side of (4.150) is the crucial term, which can be expressed as 
. _ oT 
[oareae=F fF aedt + ¥h, (4.152) 
To To 


with W%, defined in (4.147b). 
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Lemma 4.41. Under the assumptions of Thm. 4.25, the process {we} is bounded, and the integral 
in the first term in (4.152) admits the approximation 


K 
«| a, dt = ar|T + O(1)|c(p, &), with c(p,Kk) = at —(1-1/k)'*) 
To p 


Proof. The bound on the integral of a, = 1/(1 +t)? is a calculus exercise. It remains to bound 
{wet Using integration by parts 


T 


t=To 


E = = le i 
yw, =) aay = ai] -{ [ta] Vj dt 
To To 
vr ar . al 
= larV 7 = aT, VT] + |. +H dt 


This is bounded in T’ because gee is bounded under the assumptions of the theorem. Oo 


Proof of Thm. 4.25. Lemma 4.40 combined with (4.152) and Lemma 4.41 establishes (4.122). O 


4.10 Exercises 


4.1 Consider the scalar ODE 49 = f(9) = —9° (previously explored in Exercise 2.15). 
(a) Verify that it is globally asymptotically stable. 


(b) Simulate using the standard Euler approximation: 
On41 = On + Oni iflOn) 
Verify analytically or through simulation that the discrete-time recursion is unstable for any choice 


of fixed step-size (that is, a, = ag for each n, and also ap independent of 40). 


(c) Propose a step-size rule that is successful. For what values of ty, is 0, © 9:,,? 


4.2 Compute the Newton-Raphson vector field f“™ defined in (4.14b) for the three scalar examples: 
ig) = 
(a) —VI (x) with (x) = x?(1+4+ (x + 10)?) 
(b) —VI (x) with (a2) = log(e” + e~*) 
(c) sin(x) 
In each case, 

e Obtain overlapping plots of f(@) and f“(@) as a function of 0. 

Which of the six functions is globally Lipschitz continuous? 
e Obtain the roots of f and fX* 


e Identify the regions of attraction: we say that @ is in the region of attraction an equilibrium 
0° for the Newton-Raphson flow if 
lim 0; — 0° 
too 


where 9; is the solution to (4.14a) at time t, with initial condition ©p = 0. 
Describe the region of attraction for each root of fY*". 
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4.3 Consider the root finding problem f(6*) = 0 with f,(@) = 6; — 202 and f2(@) = ||@\|? — 5. You 
can compute the two solutions {6**,6*—} by substituting 0, = 202 into the quadratic equation 
6? + 63 =5. 

(a) A normalized ODE is promising: 


#O(1) = fi (8), £9(2) = —fo(9)//1 + |||? 


where the scaling of fg is imposed so that the right hand side is Lipschitz continuous. The minus 
signs are used in the hopes of achieving stability. 

Verify through analysis or simulation that this approach fails. 

(b) Apply the Newton-Raphson flow, and plot the resulting trajectories. You can obtain trajec- 
tories using an ODE solver, or compute them explicitly using f(8;) = e~*f(8o) and then solving for 
$;. Compute or estimate the region of attraction for each of the two equilibria. 

(c) Verify that the regularized Newton-Raphson flow (4.15) satisfies conditions (a) and (b) of 
Prop. 4.4. Condition (c) fails: find all solutions to AT(@)f(@) = 0 and discuss the implications. 


4.4 The monkey saddle is the two dimensional surface defined by 
h(a, y) =a? — 3ary? 


A saddle point is a pair (x*, y*) at which the gradient Vh vanishes. 
(a) Verify that the origin is the unique saddle point 
(b) Derive the Newton-Raphson flow to find the saddle point. 


(c) Plot VA(a:,y:) as a function of t from various initial conditions to see that it does follow a 
line from h(x, yo) to the origin. 


4.5 The monkey saddle in polar coordinates is expressed h(r, ¢) = r? cos(3¢). Repeat Exercise 4.4 
for this function of two variables. 


4.6 Suppose that the following hold for the function f: R4 > R (just slightly stronger than 
assumptions (b) and (c) of Prop. 4.4): f is continuously differentiable, ||f|] is coercive and A(@) 
is full rank for each 6. Conclude that the function f is onto: for each z € R@ there is 6% € R®@ for 
which f(0*) = z. Suggested approach: consider the Newton-Raphson flow using f,(@) = (0) — z, 
resulting in 4f.(9) = —f,(9). Be sure to explain how you use the coercive condition. 


4.7 We can include a matrix gain in the gradient flow if deemed desirable: 
£9 = —VoGI(9) (4.153) 


Suppose that G is positive definite and the assumptions of Prop. 4.7 hold. Design a new Lyapunov 
function so that the conclusions of Prop. 4.7 continue to hold using (4.153). You might consider a 
weighted norm V(@) = 11/||2, = 16" MO, with M > 0. 

4.8 Let’s explore some of the difficulties minimizing the function ['(x) = x?(1 + (2 + 10)?) using 
gradient descent. One problem is that it is not convex, and also has multiple local minimum. 
Another is that its gradient has cubic growth, which introduces potential numerical problems, as 
in Exercise 4.1. 


(a) Code an Euler approximation of gradient descent rac) = —VI(©). Perform multiple runs, 
with varying initial condition (it will eventually fail when you choose an initial condition too large). 
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(b) Introduce a weighting function w: R — [1,00), and consider the normalized algorithm: 
49 = —w(9)VI(9) 


Choose a continuous weighting function so that the right hand side is globally Lipschitz continuous, 
while ensuring that the origin remains a (locally) asymptotically stable equilibrium (prove stability 
using a Lyapunov function). 


(c) Test the Euler approximation for the modified ODE with a range of initial conditions. 

This example reappears in Exercises 4.13 and 8.2. 

4.9 Oja’s algorithm. This is a famous ODE technique, designed to estimate the eigenvectors of 
an N x N matrix W. Suppose that the matrix is positive definite, so that the eigenvalues of W 
are non-negative. Fix an integer N,, < N, and suppose there is a “spectral gap” in the following 


sense: if the eigenvalues of W are ordered so that Aj > Ag > ---An, then Ay,, > An,,+1- Our goal is 
to identify these first eigenvalues, along with the subspace S spanned by the first N,, eigenvectors. 


Let m; denote an N x N,, matrix whose columns are intended to approximate elements of S. Oja’s 
subspace algorithm is expressed as the polynomial differential equation: 


qm = [I — memt|Wmz, (4.154) 


where mp is given as initial condition. It is known that, for “most” initial conditions, the solution 
to the ODE is convergent and the limit ma lies in S (see [91] and also [278, 321, 60}). 


(a) Review Exercise 4.8, and observe that Oja’s algorithm poses a similar challenge since the right 
hand side of the ODE (4.154) is not Lipschitz. Propose a modified ODE through the introduction 
of a scalar weighting function. 


(b) Experiment with this method to compute the first few singular values of a matrix A of your 
choosing (a;(A) = \/Ai(ATA)). 

4.10 Analysis of Oja’s algorithm. Consider the case N,;, = 1, so that m; is a column vector. Assume 
as above that W is positive definite. 


(a) Using the Lyapunov function V(x) = $llall?, show that there is co > 0 such that 
£V(m:) < —V(m), whenever V(mz) > co 


Conclude that the trajectories are ultimately bounded, in the sense that V(m:z) < co for each initial 
condition, and all ¢ sufficiently large. 


(b) Take a second look at your expression for £V(m), and establish that in fact ||m:|| > 1 as 
t — oo from each initial condition. 


(c) Let {v’} denote an orthonormal basis of eigenvectors of W, and write 


N 
m= > a,(i)u® 
i=1 
N 
F <4 = ae -\12 
From the foregoing we have 1 = Jim I|mel] = jim dla) ; 
— 
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Establish the following ODE for the coefficients: 


(d) Verify that the solution to the ODE from (c) has the representation: 
t = 
az (7) = exp ¢ [Ai — Jar) a(t) 
0 


(e) Assume that A; is the maximal eigenvalue of W that is not repeated, so that A; < A, for i > 2. 
Show that a1(t) > 1 exponentially quickly in this case. For this it is useful to write 


a,(é) = exp([A; — Ai]t) exp( / ‘Dy = i, jdr ) ao(i) 


4.11 Failure of PJR averaging. Thm. 4.25 tells us that PJR averaging will lead to the optimal 1/T 
convergence rate, provided the vector Y defined above (4.118) is null. In this exercise you will see 
that this assumption cannot be taken for granted. Consider the scalar QOA ODE £9, = ae, &) 
in which 

f(,&) =-(l+sin(t))%+&?, %¢ER 
where & = (&?, sin(t))™, and the scalar signal {&?} has zero mean. 


(a) Obtain f, 0*, and expressions for the time-varying quantities of interest in (QSA5): 
FE), A(,E), AC, Er) 


(b) Choose &? so that it has mean zero, yet Y = 1. 


(c) Verify numerically that PJR averaging fails for this example, but (4.122) does hold. It is 
enough to verify through simulation that 


az {OF — 6*} = c(p,)V/A*, for T very large 
Take p and « of your choosing, respecting the assumptions of Thm. 4.25. 


1 
4.12 Exploration in gSGD. This problem concerns qSGD #1: +O; — ~~ ET (Ot + e&) 
You may assume a; = (14+ t)? (respecting QSA theory). 
The domain of the objective function is R?, and in this problem you will assume it is quadratic: 


r(0) = 507TM0, OER’, with M>0 


The origin is the unique minimizer (by definition of the positive definite condition M > 0). 


In this exercise you would be wise to apply a variant of Lemma 4.37 (see also (4.49)): for any 
polynomial function g: R? > R, 


it 1 pl 
lim =f g(cos(t), sin(zt)) dt =| | g(cos(2nt1), sin(27t2)) dtidte 
0 0 JO 


To0 T 
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(a) Consider a simple probing signal & = pyv where p; = cos(t) + sin(mt) and v € R? is fixed. 
Obtain f in the ODE approximation 49, = f (8), and conclude that this approach fails. 
(b) Consider now & = cos(t)v! + sin(rt)v? with v' = (1,0)T and v? = (0,1)T. Obtain f in the 
ODE approximation 91 = f(9;) and identify its stationary points. 
Is ©, convergent in this case? If so, does the limit approximate the minimizer 6* = 0? 
(c) Continuing with the special case (b), obtain expressions for the time-varying quantities of 
interest in (QSA5): 

£(9,&:), A(O,Es), AC, Ex) 
Based on this obtain on expression for Y. 


(d) How do you expect your conclusions will extend to an objective function that is not quadratic? 
You should be able to find conditions under which ||Y|| = O(e). 


4.13 Use qSGD combined with PJR-averaging to obtain the minimum of I'(x) = 2?(1+ (a+ 10)?). 
Test your algorithms for a range of €, and initial conditions @p) < —10 and also @p > 2. 

Review Exercise 4.8 before proceeding: success will require projection or some other mechanism to 
ensure boundedness of your estimates (neither [ or its derivative are Lipschitz continuous). 

(a) Experiment with each of the three qSGD algorithms, provide plots of the estimates as a 
function of time, and comment on your initial findings. Decide on your favorite algorithm for the 
remainder of the exercise. 

(b) Comment on how the rate of convergence is impacted by ¢ for Op > 2, and the likelihood 
of becoming trapped with @g < —10 (here ¢ > 0 scales the probing signal in each of the qSGD 
algorithms). 

Obtain a plot of your estimate of 0% as a function of ¢ > 0 for Op > 2, and comment on observed 
bias. 

(c) See if you can design a time-varying process {¢;} that results in a reliable algorithm that is 
convergent for any initial condition Op. 

Test your final design with the modified function [,,(x) = x7(1+ (2+ m)?) for a range of values of 
m (say, m € {—4,4, 8}). 

4.14 Consider again the MagBall example introduced in Section 2.7.3. This exercise is a followup 
to Exercise 2.19. Our goal is to maintain the ball at rest at some pre-assigned distance rg from the 
magnet. 

Our approach is to use gradient free optimization: Let c(x,u) = 9? +u”, where 7 = 21—1r9. Propose 
a family of policies u = 69(x), based on your insight from Exercise 2.19, and minimize E[J9(X)] 
using qSGD (see (4.103) and surrounding discussion). 


4.15 Revisit Exercise 3.10, now with the introduction of a cost function c(x, u) = ||a||? + u?. 
Based on your insight from Exercise 3.10, propose a family of policies u = °(a), 8 € R® (choose 
d < 4). Obtain an approximation of the minimum of E[Jg(X)] using qSGD. 


4.16 Optimization of the rowing game. Rather than apply LQR, as in Exercise 3.9, optimize 
(K,, K1, Ky) using qSGD. Are your results similar to what was obtained in Exercise 3.9 when N 
is large? 


4.17 This exercise (as well as Exercise 4.16) might clarify the remarks at the close of Section 4.7.2: 
when we search for policies that are optimal within a specific class, we are often forced to abandon 
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dynamic programming and instead use optimization techniques such as qSGD. 


The state evolves on X = R”, in continuous time, in which the derivative of each state is directly 
influenced only by the local input and a single neighbor: 


far(i) = ae(i — 1) + ux(i), l<i<n 
where for notational convenience we set x;(0) = 0. The cost function is quadratic, of the special 
form c(x,u) = ||z||? + r|ul|? with r > 0. 
(a) Obtain the n x n feedback gain K* for a range of n, and see if you discover any special 
structure. You might also see what happens for very small or very large r. 


(b) We next obtain an optimal gain over a restricted class of policies: u;(i) = —@x;(i) for each i 
and t. Devise a QSA ODE to find &* and for a range of r compare the performance of your solution 
to what was obtained in (a). For this you must compute the value function J associated with your 
policy, and compare it to J*. 


4.18 Consider the linear QSA ODE with multiplicative noise (4.88): 
£0; = arf (Oz, &) = ar[Ao + €& Ai]; 


with & =sin(wt+ ¢). 

(a) What is 6*? What is the apparent noise (4.69) in this case? Conjecture on the rate of 
convergence of the QSA ODE (4.44) with Ag Hurwitz and a; = 1/(1 +t), and see if you can verify 
your conjecture in the scalar case d = 1. 


Theory for the constant gain algorithm a; = a is a bit trickier: 


(b) Consider this scalar example, 
£0, = —[a + c&]O; 


Estimate the range of (a,¢) for which the ODE is stable. You might see if you can obtain analytical 
results: the scalar linear ODE ae) = 3,©, admits a closed form solution (skim Exercise 4.10 to 
see the solution for a different application). 


(c) Suppose that there are n linearly independent eigenvectors {v'} for Al, and that these are 
also eigenvectors for Aj: for possibly complex numbers {;, /ui}, 


Ala = ge; Ala = jyo’ 


Explain how in this special case you can use your results from (b) to obtain conditions for stability 
in the constant gain algorithm. 


The next part shows that stability is not so simple when this eigenvector assumption fails. 


(d) Stabilization by noise. The article [36] contains a characterization of stability for the general 
linear algorithm with constant gain. The theory is illustrated with this numerical example: 


£0; = [aAo + €& Ai] Oz 


; 0 1 6 13 
using Ay = | | A= a5 eA 


The eigenvalues of Ap are each zero, and those of A; are {1,—2}, so neither matrix is Hurwitz. 
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Figure 4.14: Stabilization by noise (from [36]). 


For a non-zero initial condition @9, the Lyapunov exponent is defined as the limit 
A(a,€) = lim t~* log(||©¢||) 
t-00 


Fix ¢ = 1/5, and obtain a plot of estimates of A(a,¢) for a > 0. Are your results consistent with 
the stability region shown in Fig. 4.14? 


4.11 Notes 


These notes consist of many components, reflecting the breadth of the chapter. 


ODE methods for algorithm design The ODE (4.14a) was introduced in the economics liter- 
ature, which led to the comprehensive analysis by Smale [325]. The term Newton-Raphson flow for 
(4.14a) was introduced in the deterministic control literature [320, 370]. The Zap SA algorithm was 
introduced at the same time, and based on the same ODE [112, 90, 110]. Within the optimization 
literature, the term “Newton-Raphson dynamical system” is used: see [5] for history.'4 Much more 
on this technique will appear in the second half of the book. See Section 8.4.2 for a variant that 
does not require matrix inversion. 

The paper [335] sparked new appreciation for ODE methods within the optimization community 
(with particular interest in applications to ML). The goal was to understand the dynamics of two 
common optimization algorithms with “acceleration”: 


(i) Polyak’s heavy-ball method: 
Tea = tp = 0,41 eR = Tp] — Gya VI (are) (4.155) 
(ii) Nesterov’s accelerated gradient algorithm: 
Tht = Yk — OVE (Ye), — - Ye = Te + OK[Le — Ler] (4.156) 


Either recursion reduces to the gradient descent algorithm (4.28) when 6, = 0. Polyak’s algorithm 
takes 6, = 6 > 0; Nesterov’s algorithm uses 6, = (k — 1)/(k +2). Theory behind these algorithms 
typically requires a, = a independent of k. 


14 


[5] 


many thanks to Vivek Borkar for alerting me to Smale’s early contributions, and to Francis Bach for passing on 
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An ODE approximation for Polyak’s algorithm (4.155) with 6, = 6 is easily anticipated on 
denoting 0, = 2p, and Dp41 = O¢41 — Ox [16]. For simplicity take a = 1, and write (4.155) as 


Dy — Dg = —(1 — 5) Dy — VT (0x) 


This is an Euler approximation of the second-order ODE, 


@ 9 = —(1—6)49 — VP (9) 


Nesterov’s algorithm (4.156) is considered in [335], using the favored time-varying choice of d,. A 
similar ODE approximation is established: 


#9=-849-VT(9), 6 =3/t 


The appearance of “3” is explained by the representation 6, = 1 — 3/(k + 2). 

Better understanding of the dynamics of these ODEs led to a much fuller understanding of 
why Polyak and Nesterov were so successful [16, 335, 198]. This work was part of the inspiration 
for growing interest in ODE and stochastic differential equation (SDE) approximations for more 
efficient recursive algorithms [294, 375, 200, 318, 384], for neural network approximation [358, 85], 
and ODE design based on concepts from robust control theory [163, 164, 127]. 


Optimization Luenberger has been my favorite source for teaching optimization [231, 232], but 
the best encyclopedic treatment is probably [74] together with the recent book [19]. 

A version of Thm. 4.9 is found in Polyak [286]. The main assumption (4.27) is a restricted 
form of a bound introduced by mathematician Lojasiewicz, which is why this is called the Polyak- 
Lojasiewicz (PL) condition in the optimization literature. The simple proof of Thm. 4.9 is taken 
from [179] (which contains much more insight and applications). 

The conclusions of Prop. 4.11 can be improved with a modified algorithm and a more carefully 
constructed Lyapunov function [292, 116]. 

It is worth looking over the field of online optimization with application to control [263, 96]. 
The goals are similar to those of qSGD and policy gradient algorithms. 


QSA_ Section 4.5 was initially conceived as an early introduction to stochastic approximation for 
algorithm design—a topic explored in Chapter 8. Over the course of writing this book, the mission 
evolved to become a stand-alone toolkit for optimization and control. 

Much of Sections 4.5 to 4.7 is adapted from [40, 41, 88], which was inspired by the prior results 
in [246, 319]; [93] contains applications to gradient-free optimization with constraints. The QSA 
concept was first introduced in [212, 213] for applications to finance. 

The theory of two time-scale stochastic approximation enjoys a parallel history with the theory 
of singular perturbations for differential equations, which has played an important role in control 
theory and applications [186]. A simple example is the dynamical system described by following 
set of differential equations: 

421 = f,(x1, £2) 


etno = fo (x1, x2) 


It is assumed that 0 < ¢ < 1, so that the dynamics of x2 are much faster than that of x,. Suppose 
that this is a function of time, with e«; | 0 as t t co. Assume moreover that there is a continuous 
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function @ satisfying fo(21,¢(a1)) = 0 for each 2;. Under further conditions, there is a tight 
approximation between the ODE above and 


401 = f;(21, 6(21)) 


This is the thinking behind Zap QSA (4.91). 

PJR, averaging was introduced independently by their namesakes [306, 287, 288] (note that 
Polyak had independent contributions prior to his collaboration with Juditsky). This work has 
nothing to do with QSA, but concerns optimizing the covariance Ug appearing in (6.40) for stochas- 
tic approximation—see the Notes section of Chapter 8 for more background. The application of 
averaging techniques for rate optimization in QSA appears to be new. 

The function g in Lemma 4.37 (ii) is precisely the solution to Poisson’s equation, with forcing 
function g = g—ap, that appears in theory of simulation of Markov processes, average-cost optimal 
control, and stochastic approximation [144, 12, 250, 39]. 

The phrase ODE method is frequently tributed to Ljung [229], though most authors use this to 
mean a method of analysis, rather than a technique for algorithm design. Polyak in [136] credits 
Tsypkin [357] for the realization that stochastic approximation is an invaluable ingredient in the 
creation of algorithms for learning. 

The “ODE@oo” (4.113) was introduced in [70] for stability verification in stochastic approxi- 
mation: Prop. 4.22 is a very special case of the Borkar-Meyn Theorem [70, 67], which has been 
refined considerably in recent years [296, 297]. The use of abstract ODE models to verify stability 
of stochastic recursions also appears in queueing networks [98, 99, 254] and MCMC [134]. We will 
revisit this approach to stability verification in Chapter 8. 

Assumption (QSA5) is analogous to common assumptions in the study of simulation or stochas- 
tic approximation algorithms when & is a Markov process [144, 38]. Conditions for a well behaved 
solution to Poisson’s equation are available, subject to conditions on the Markov process and the 
function. In particular, for stochastic differential equations (SDEs), a non-degeneracy condition 
known as hypoellipticity is a first step, and then a solution to Poisson’s equation exists subject to 
a Lyapunov function drift condition [144]. While the process & defined by (4.79) is Markovian, it 
is purely degenerate in the sense that Poisson’s equation (4.140) in differential form is a first order 
PDE: 

g(z) + 09(2)-H(z)=5, 2€0 


There is little theory available for well behaved solutions beyond the simple special case considered 
in Section 4.9.4. 


SGD and Extremum seeking control In gradient-free optimization the goal is to minimize a 
loss function ['(@) over @ € R¢. It is possible to observe the loss function at any desired value, but 
no gradient information is available. 

The topic has been studied in two, seemingly disconnected research communities: techniques in- 
tended to directly approximate gradient descent through perturbation techniques, known as Simul- 
taneous Perturbations Stochastic Approximation (SPSA), and Extremum-Seeking Control (ESC) 
which is formulated in a purely deterministic setting. Algorithm (4.99) is a stylized version of the 
ESC approach. Much of Sections 4.5 and 4.6 is taken from [40, 41, 86, 87]; [52] also develops SPSA 
using a specially designed class of deterministic probing sequences. 
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See [202] for history of gradient free optimization in adap- 
tive control, and [348, 228, 11] for the nearly century-old 
history of ESC (and [298, 299] for the Russian perspective). i 
The following text from [348] is striking: 1a 


In his 1922 paper, or invention disclosure, Leblanc de- « | 
scribes a mechanism to transfer power from an over- meee ie 
head electrical transmission line to a tram car using = rele = € 
an ingenious non-contact solution. In order to main- + all 
tain an efficient power transfer in what is essentially ce 
a linear, air-core, transformer/ capacitor arrangement 
with variable inductance, due to the changing air-gap, 
he identifies the need to adjust a (tram based) induc- 
tance (the input) so as to maintain a resonant circuit, 
or maximum power (the output). Figure 4.15: Schematic taken 
from Leblanc’s 1922 disclosure [217], 
which is considered the birth of ex- 
tremum seeking control. 


= 
= 
H 


Fig. 5. 


Leblanc explains a control mechanism of how to main- 
tain the desirable maximum power transfer using what 
is essentially an extremum seeking solution. 


This discussion refers to the 1922 disclosure [217], which amounts to an analog implementation of 
gradient free optimization. A schematic from this document is shown in Fig. 4.15. 

Theory for SPSA began with the algorithm of Keifer-Wolfowitz [182], which requires at each 
iteration access to two perturbations per dimension to obtain a stochastic gradient estimate, as in 
qSGD #3. This computational barrier was addressed in the work of Spall which sparked further 
research [327, 328, 329, 51, 52, 50, 54, 148]. Most valuable for applications in RL is the one- 
measurement form of SPSA introduced in [329]: this can be expressed in the form (4.43), in which 


fF On, O54) = T(On + EDn 41) Ons (4.157) 


where ® is a zero-mean and i.i.d. vector-valued sequence. The qSGD #1 algorithm (4.96) is a 
continuous time analog. 

The introduction of [274] suggests that there is an older history of improvements to SPSA in 
the Russian literature: see eqn. (2) of that paper and surrounding discussion. Beyond history, the 
contributions of [274] include rates of convergence results for standard and new SPSA algorithms. 
Information theoretic lower bounds for optimization methods that have access to noisy observations 
of the true function were derived in [170]. This class of algorithms also has some history in the 
bandits literature [1, 79]. 

In all of the SPSA literature surveyed above, a gradient approximation is obtained through the 
introduction of an i.i.d. probing signal. For this reason, the best possible rate is of order 1/,/n, 
which is far slower than can be obtained using QSA techniques. 


Policy gradient techniques are traditionally posed in a stochastic setting, in which & is i.i.d. 
(independent and identically distributed). The most popular approach is the Actor-Critic method, 
in which a value function approximation algorithm such as TD-learning acts as a sub-routine. There 
is an enormous literature, and it is best to refer to [44, 338] for history, as well as the recent work 
[237]. 
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Chapter 5 


Value Function Approximations 


We now have all the preliminaries necessary to describe reinforcement learning algorithms designed 
for value function approximation. 

The approximation techniques are built around a family of functions denoted H. Standard ex- 
amples discussed in Section 5.1 include neural networks and kernels, as well as linear approximation 
using a basis (an example of this can be found in Section 4.5.3). In most cases the function class 
is finite-dimensional, with dimension denoted d. For example, to approximate the Q-function Q* 
defined in (3.7a), the family is denoted {Q® : 6 € R%}. 

Most algorithms are based on optimization: an algorithm designed to compute the optimal 
parameter 9* will be based on some loss function [(@), with 6* = argmingI(@). The algorithm 
may be recursive, in which case it generates a sequence of parameter estimates {@,,}, designed so 
that 0, — 0* as n > oo. It should come as no surprise that concepts from Chapter 4 will guide 
algorithm design. 

Reinforcement learning algorithms are typically designed 
aoe a to be model free, in which the inputs to the algorithm consist of 
———> | min € (2; u, %p,¢) |——> three terms: {u(k)} the input sequence to the control system, 

the sequence of observed costs c((k), u(k)), and observed fea- 
Fiswe: Sade Ovline Gdeaminy: tures that depend on the class of algorithms. For the linear 
inputs are observed features, and parameterization O' Gu) — ‘SF O:~;(x,u), the sequence of 
costs or rewards features is the d-dimensional sequence {~(a(k), u(k))}. 

Fig. 5.1 is included to emphasize that these are the only 
inputs to the algorithm. We don’t require a model, and the state sequence {x(k)} may not be 
fully observed. For any approximation Q® we define a policy inspired by optimal control theory (in 
particular, eq. (3.7c)): 

° (2) = arg min Q(z, wv) , LEX (5.1) 
U 


In standard control textbooks there is a two-step process: 1. identify a model, such as the ARMA 
model (2.4), and 2. design a control solution based on this model (perhaps through optimal control 
techniques). Step 2 is often a significant computational challenge. One of the great achievements 
of RL is to sidestep this challenge by estimating the Q-function directly. 

System identification and value function approximation share common challenges and remedies. 
The notion of exploration that is so important in this chapter is entirely analogous to the persistence 
of excitation requirement in system identification [81, 207, 293]. 


161 
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What’s a Good Approximation? If you have read Chapter 3 on optimal control, you 
surely want to learn how to approximate the Q-function Q*. However, you are more eager to 
obtain an estimate of the optimal policy: 


b (2) = are min” (7,1), LEX 
weEU(x) 


A few things to keep in mind as we evaluate an algorithm: 


(i) Approximation fidelity. We do not need a highly accurate approximation of the Q- 
function if our goal is to obtain a policy that is approximately optimal. Rather, the goal 
is that the performance of the policy 6% is approximately optimal. Ideally then, (6) 
would be some measure of policy performance. The mean-square Bellman error (5.5) is 
a common surrogate. 

(ii) Policy evaluation. Suppose it is computationally feasible to compute or approximate 
T'(6,) (perhaps based on a model). In this case, we can keep a tally of performance for 
selected iterations {T(6n,) : k > 1}. We then select those policies among {9"» : k > 1} 
with the best performance. Most likely we will do further testing, following the guidelines 
in Section 2.2, and suggestions found in Section 5.1.5. 


5.1 Function Approximation Architectures 


This section might be viewed as the briefest crash course on machine learning. See [57] for a more 
leisurely introduction to the function approximation concepts covered here. 

The goal is to approximate a function H*: Z — R, where interpretation of H* and the definition 
of the set of points Z depends on context. This is regarded as a learning problem when the estimate 
is based on data gathered in an experiment. For example, H* might be the Q-function defined in 
(3.7a), and Z = X x U. In this case, the data will be obtained from experiments on the control 
system: the input applied to the system, along with functions of the resulting input-state process. 

The techniques described here for function approximation require a few ingredients: 


(i) A function class H. Three examples are described below: a d-dimensional linear function 
class, d-dimensional non-linear function class defined by a neural network, and one infinite- 
dimensional class: the reproducing kernel Hilbert space (RKHS). 


(ii) For each h € H we associate a non-negative “loss” denoted '(h). The loss function is designed 
so that T'(h) is small when h = H*; our approximation is lousy if '(h) is very large. We impose 
just one requirement on this loss function: assumed given are samples {z;:1<i< N} CZ, 
and I depends only on h evaluated at the samples. Consequently, rather than thinking of the 
domain of F as the abstract collection H, it is a mapping T: RY > R, with 


P(h) =P (h(21),-.., A(z) (5.2) 


(iii) An algorithm to obtain the minimizer of ['(h) over h € H. This book is filled with techniques 
for constructing algorithms, and techniques to obtain insight on their rate of convergence. 
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The objective [ in (5.2) is known as the empirical risk, and its minimization over a function class 
H is known as empirical risk minimization (ERM). 
We begin with two examples of loss functions, and three examples of the function class H. 


5.1.1 Function approximation based on training data 


Curve fitting Suppose that we have noisy observations of a function H*: R — R: 
ys = A" (2%) + dj 


where the noise {d;} is not too large, and has nice statistical properties (for example, its average 
is close to zero). The sequence {(z;,y;) : 1 < i < N} is called training data. The quadratic loss 
function is defined by 


N 
ra)=— uaa, hex (5.3) 
i=1 


If '(h*) = 0 then the function exactly matches the observations: h*(z;) = y; for each i. This looks 
like good news in the disturbance-free setting (dj = 0), so that h*(z;) = H*(z;) for each i. 


y Over-fitting: A Smooth function with low error: Under-fitting: * (2:1, yi) 
yi = h(z;) for each i yi & h(a) for each i Excessive regularization — h(z) 
results in large error 
e e 

e 8 

as ° se 7 x . oe fs) 
e e e ee 
” Zz > z > z 


Figure 5.2: Three attempts to approximate the data {z;, y;} with a smooth function. 


Fig. 5.2 shows function approximation outcomes from three different algorithms: each algorithm 
constructed the function h based on the training samples {(z;, y;)}. The first plot illustrates typical 
results when we put too much trust in the data: we achieved T(h) = 0, which should be good 
news. However, it is unlikely that the true function exhibits so many peaks and valleys — this 
behavior is most likely the product of a bad algorithm. The term over-fitting is used to describe 
this undesirable behavior when ['(h) ~ 0. A good algorithm produces the smooth approximation 
shown in the middle. This is achieved using a regularizer. With too much regularization, you 
obtain a poor approximation, as shown on the right. 

The preference for the middle plot in Fig. 5.2 is based on a “smoothness prior” for the underlying 
data; a substitute for the probabilistic priors used in Bayesian statistics. 


Mean-square Bellman error In Section 5.3 we begin a survey of techniques to estimate the 
optimal Q-function Q* defined in (3.7a). This is a function approximation problem in which H = Q* 
and Z = XxU. Our second glance ahead in Section 3.7 provided a roadmap, inspired by the Bellman 
error (3.7d). For any function Q: X x U > R, and any input-state sequence (wu, x), the temporal 
difference is defined in (3.47), and recalled here: 


De+i(Q) = —Q(w(k), u(k)) + e(a(k), u(k)) + Q(a(k + 1) (5.4) 
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with Q(z) = min, Q(x, u). 

Given a time horizon NV > 1, and the input-state sequence {u(k),2(k) :0<k < N}, we must 
take N = N + 1 and observations z; = (a(i — 1),u(¢ — 1)) to match the notation (5.2), and from 
this define the loss function 


N 
So [Dilh(zi), A(z41))]? (5.5a) 


i=l 
Dj (A(z), h(zi41)) = —h(ai — 1), uli — 1)) + c(a(e — 1), u(t — 1)) + h(a) (5.5b) 


with h(x) = min, h(z,u) for any function h. The complex looking term (5.5b) is the temporal 
difference, Dj(h(z;), h(zi41)) = Di(h), as defined in (5.4). 


r(h) = 


ai 


Empirical distributions In the RL literature you will find the term experience replay buffer in 
reference to training data, and from this the empirical distribution (or empirical pmf) generated 
from this data: 
, Na 
@ (2,u,2t) = ra So ife(k) =a, ulk)=u, 2(k+1)=27}, a,2t+eX,ueU (5.6) 
k=0 


This is a pmf on X x U x X for any sequence {x(k),u(k)} and any N > 1. Simple accounting leads 
to the following alternate expression for (5.5a): 


r(h) = > ON (a,u,2*){-A(a, u) + e(2, u) + h(at)P (5.7) 


vu,et 


where the sum is over all (x, u,a+) € X x U x X for which @ (x, 2+) > 0. This interpretation of 
T'(h) as an empirical mean is useful for both intuition and theory (such as the LP approach to RL 
that is surveyed over the final sections of this chapter). 


5.1.2 Linear function approximation 


This refers to a family of functions, linearly parameterized by 6 € R¢: 
d 
h(z)=SoOi(z), 2EZ (5.8) 
i=1 
where {y;} are the basis functions. It is convenient to stack these together to form a function 


w: Z > R¢, and then write h? = 67). For any smooth loss function, the first-order condition for 
optimality is 0 = VgI'(h®). For the mean-square Bellman error (5.5a) this becomes 


N 
0= <> Dy (h(z4) sh (2e41))C(R) 
k=1 


where C°(k) = VoDg(h? (ze), h? (ze41)) 


(5.9) 


The choice of basis can be informed by some understanding of the control problem. For example, 
if Z = R? and it is known that H* is convex, then it may be sufficient to choose h? quadratic, with 
d=6: 


s(z) = 24, Walz) = 20, os(z) = 2, and w6(z) =1 for all z € R’. 
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Tabular and binning A common choice in the theory of RL is the tabular setting in which z 
denotes a typical pair (#,u). It is assumed that X and U are finite, and an ordering is chosen: 
Xx U = {2' = (a’,u’) 1 <i < d} in which d is the total number of state-input pairs. The tabular 
basis is the family of indicator functions 


wi(x,u) = 1{(x,u) = (z',u’)}, (z,u)EXxU, l<i<d (5.10) 


If Z = Xx U is not finite, then we can apply binning. Recall from Section 3.9.1 how binning was 
applied to obtain an approximate model for the Mountain Car example through first “quantizing” 
the state space, and then computing the optimal policy for the approximate model using VIA. In 
RL we do not approximate the model, but we can perform binning through an extension of the 
tabular basis. Given a disjoint decomposition Z = (e B;, the ith basis is the indicator function 


wi(x,u) = 1{(2,u) € Bi}, (z,u)EXxU, l<i<d (5.11) 


This is also written ~; = 1,,. Exercise 9.3 is designed to show how this can be a good choice in 
RL, in the sense that it leads to a consistent algorithm for value function approximation. 


Galerkin relaxation The term Galerkin relaxation appears throughout the book as a means to 
approximate equality constraints, and sometimes also inequality constraints. As an example of this 
technique, consider again the loss function (5.5) associated with the mean-square Bellman error. 
An alternative approximation of the DP equation is obtained by constructing a d¢-dimensional 
sequence {¢(k)}, and search for a function h that satisfies the constraint: 


N 


0 = —S° Dg(h(ze), h(ze41)) G(R) 5 l<i<d (5.12) 
k=1 


This is called a Galerkin relaxation, and certainly a relaxation of our ultimate if unrealistic goal: 
to find a function h for which the temporal difference Dz (h(zx),h(ze41)) is zero for each k. In 
the context of RL, the vectors {¢(k)} appear as eligibility vectors in standard algorithms (see 
Section 5.4). 

For a finite-dimensional function class we take d¢ = d, so that (5.12) represents d constraints, 
which is consistent with the d unknowns, {67 : 1 <i < d}. 

Equation (5.12) appears similar to (5.9). However, ¢(k) = ¢°(k) is not a valid choice, since the 
Galerkin relaxation does not allow ¢(k) to depend on the 6. In practice we might design {¢(k)} so 
that ¢(k) ~ ¢°(k) for @ in a region of interest. 

We are not always so fortunate to have intuition regarding the shape of H*, and binning may be 
too complex, which is why there has been so much attention focused on the “black box” function 
approximation architectures discussed next. 


5.1.3. Neural networks 


Neural networks can be used to define a parameterized family of approximations {h*} that are 
highly nonlinear in 6. The purpose of this very brief introduction is to explain how a neural 
network can be used for function approximation, and especially for applications to value function 
approximation. 
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Input Layer Hidden Layers Output Layer 


Figure 5.3: Neural network with three hidden layers. 


Fig. 5.3 shows an example of a feed-forward neural network with a single input layer, a sin- 
gle output layer, and three hidden layers (the optional bias terms are not included). For our 
purposes, this figure represents a function approximation h: R? > R, so that the input layer is 
# = (21, 22, 23)". 

This is called a feed-forward network because calculation of y as a function of z is performed 
sequentially, moving from left to right. For given weight vectors {wi} (whose dimensions will be 
clear from the definitions), the calculations proceed as follows: 

The first step is to calculate values s! € R* in hidden layer one, via 


si =o((we,z)), 1<k<4 


where the notation (wz, z) represents the usual dot product of two vectors, and a: R > R is known 
as the activation function. Two standard choices: 


Sigmoid: o(r)=1/(1+e") ReLu: o(r) = max(0,r) 
Calculation of s?, s? € R* is similar: 
si = o((wes")), 8% =o((we,s*)), 1sSks4 


The output is then defined by y = (wi, s°), which is a linear function of the third hidden layer, but 
a complex nonlinear function of the input z. The weights are identified with the parameter 0: we 
may write y = h®(z), with 


{(rl<i<dtH{w}, d=3x4+4x444x444=>48 


5.1.4 Kernels 


Let’s start at the conclusion: when applying kernel methods, our approximation of H* takes the 
form 


N 
W(z) =o Oik(z,a), 2 €Z, (5.13) 
tL 


where k is the kernel function that we choose from a large library. 

You might argue that this is simply the linear function approximation approach described 
earlier, with d= N and y(z) = k(z, x) for each 7 and z. Your argument is absolutely correct! To 
appreciate the kernel method, you need to see how we arrive at this particular form for h?. 

We return to the “beginning”, which is the choice of kernel. 
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Choice of kernel, and requirements Three standard examples are 


Ne of 2 
Gaussian: k(z, 2’) = exp( Iz = 2'l| ) 
20° 
oA = eh 
Laplacian: k(z, 2’) = éq( 
o 
Polynomial: k(z, 2’) = (r(z, 2’) #1)™, ey: 


where o > 0, r > 0 and m > 1 are design parameters. 

Recall that in some control applications we may know that H* is convex and non-negative. In 
this case, the polynomial kernel is attractive because h? in (5.13) is convex if m is even, and {6;} 
are non-negative. 

Each of these three examples has the symmetry property, k(x, y) = k(y, x). This is one of the 
several required properties of a kernel. A crucial requirement is that it is positive definite: for every 
n > 1, every collection {z;:1<%i<n}cC Z, and every a € R", 


n 
Ss" jj k(z4, 2;) > 0 (5.14) 
ij=l1 


with equality if and only if «= 0. 


Function class for approximation Once we have selected a kernel, we arrive at a function 
class H: an infinite-dimensional analog of the set of functions {h° : @ € R@} defined in (5.8). There 
is no space in this book to give a full definition of H, and the norm || - || that is a critical part of 
the theory. 

For our purposes, it is enough to know that H contains every function of the form (5.13). That 
is, for any integer n, scalars {«;}, and {z;} C Z, the following function lies in H: 


h*(z)= So oik(z,%), 2€Z 
1=1 


The primitive functions h® are also dense in H. That is, if h € H, then for each ¢ > 0, there is h® of 
this form (for some integer n, scalars {«;}, {z;} C Z, all depending on ¢), satisfying ||h —h*||q < «. 

For any two functions h®, h? of this form, an inner product is introduced that is consistent with 
the norm: 


(h*,hP)p = S > cei Bik (zi, 2) (5.15a) 
ij=l 
[In |lae = VV (A, A) 94 (5.15b) 


The positivity assumption (5.14) ensures that (h*, h™)z, is non-negative. The definition of the inner 
product and norm can be extended to the larger collection of functions H, and endowed with this 
inner product it is known as a reproducing kernel Hilbert space (RKHS). 

Details regarding H are not required in applications because the Representer Theorem tells us 
we can restrict to the primitive functions in the function approximation problems of interest to us. 
To present this theorem requires one more ingredient. 


Pre-publication draft -- March 25, 2022 


CHAPTER 5. VALUE FUNCTION APPROXIMATIONS 168 


Regularized loss function In addition to the loss function T we require a regularizer of the form 
G(||hl|zz), where G: Rz + R, is non-decreasing. Typical choices are G(r) = 5r? or G(r) = dr, 
with 5 > 0. Our interest is solving the regularized optimization problem: 


h* = argmin{T(h) + G(||Allq) :h © H} (5.16) 
The regularizer is introduced to manage the over-fitting problem illustrated in Fig. 5.2. 


Theorem 5.1. (Representer Theorem) Suppose that {z:1<i< N} are given, along with 
a loss function of the form (5.2). Then, any minimizer of the optimization problem (5.16) can be 
expressed, for some x* € RN, 


h*(-) = Do ofk(-, 2) (5.17) 


We return to the two examples: 


Curve fitting Consider the quadratic loss (5.3). If the regularizer is also quadratic, G(r) = 5r?, 
then the Representer Theorem provides an explicit solution to (5.16). We are left to obtain the 
optimal parameter: 


N 
a= arg min{ S“ [yi = h*(z)]? + 5||h™||3,} 
i=1 


Let K denote the n x n matrix with entries Kj; = k(zj,z;). We then have ||h%||3, = oT Ka, and 
A*(z;) = D0, %jk(zi, zj) = [Ko];. To compute o* we set the partial derivatives of the loss equal to 
zero: 


N N 
0= {9 lu — [Kalil? + ba Ka} = -2) ly — [Kai] Kiy + 26[K 
J i=l i 


With y, x* € R% column vectors, this gives 
o* =(K™K+5K)'KTy (5.18) 

The transpose in (5.18) is not necessary, since K = KT by assumption. 
Mean-square Bellman error We no longer have an explicit solution to (5.16), even with G 
quadratic, but we know that h* is of the form (5.17) for some vector «*. Hence finding h* is a finite- 
dimensional optimization problem. Exercises 5.10 and 5.11 provide a roadmap to approximations 
of this complex nonlinear optimization by a sequence of quadratic optimization problems. 

A more successful approach may proceed using a convex loss function T: RY — R,, constructed 
by applying the representations in Section 3.5; more on this approach may be found in Section 5.5. 


5.1.5 Are We Done Yet? 


Let’s think about how to answer the question within the context of minimizing the mean-square 
Bellman error using linear function approximation, which results in the root finding problem (5.9). 
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That is, fy (0%) = 0, with 


N 


Fiv(8) = 35 > Del Hl (24), Wen ))C(R) 


k=1 
While it may take you a long time to compute 0%, you are far from done. 


Some experiments you can perform to obtain more confidence that you have a useful solution: 


Is your parameterization redundant? Consider Q® = 677, an approximate Q-function. Along 
with estimates of the best value of 0, obtain the sample correlation matrix: 


av} ~ T 
RY = D> vale) (5.19) 
i=1 


Look at the eigenvalues of this positive semidefinite matrix—if there is a non-trivial null space, 
then there may be a problem with your basis or your choice of data. 
If R’v = 0 for some non-zero vector v, then obviously v1 RYv = 0, meaning that 


es oe . 
0O=vR*v = WV re) 


It follows that Q°, with 6 = v, is identically zero on the samples observed. And it means your basis 
is redundant, in the sense that one vy, is a linear combination of the others: if v, 4 0, then 


There are two potential explanations: 1) your basis is truly linearly dependent in an algebraic 
sense: uTy)(z) = 0 for every z € Z, or 2) insufficient exploration: the samples z evolve in a small 
subset of Z. 


Is your parameter predictive using fresh data? Obtain M > 1 more batches of data 
{z™ :1<m< M}, and compute fx (0) for each m, with 


N 
FRO) = = DeH (eR), HICH), «1S m<M 
k=1 


You need to increase N if there is large variability in {f@ (04) :1<m< M}. 
Is the output of your algorithm predictive of what really matters? This will take some 


work, but it is truly essential. With M >> 1 batches of data, estimate the performance you obtain 
with the output of your algorithm. This means that for each m= 1,..., Md you must 


(i) Obtain an estimate 6*"” using your algorithm. 


(ii) Obtain 6*™ (x) = arg min, Q®” (2, u) 
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(iii) Run more experiments to estimate the performance. For the total cost problems considered 
here, choose a pmf v with finite support. For each initial condition x’ satisfying v(x’) > 0 
run a simulation to estimate J(x*) under policy 6*™, and then obtain [,, = >>; V(x") Im(z"). 
Look at the sample mean and variance of {Tm :1<m< M}. High variance means you need 
a longer run. 

Or, you might decide to look more closely at those policies for which I, is smallest—maybe 
you got lucky! To know for sure, you need a deeper investigation of performance, using data 
independent of what was used for training. 


For many readers, the preceding discussion is mainly a guide to receive a passing grade on upcoming 
simulation assignments! If we are talking about real life, rather than a homework problem, then 
you need advice from experts. For example, if your 6* is supposed to define an optimal policy for 
an autonomous car, then you need experts in sociology as well as highway engineering to conduct 
realistic experiments to validate your control design. 


5.2. Exploration and ODE Approximations 


The success of the RL algorithms surveyed in this chapter depends in part on the choice of input 
u used for training. The purpose of this section is to make this precise, and present our main 
assumption on the input designed for generating data to train the algorithm (that is, exploration, 
as first surveyed in Section 2.5.3). Throughout this chapter it is assumed that the input used for 
training is state-feedback with perturbation, of the form 


u(k) = b(a(k), &(k)) (5.20) 
where & is a bounded sequence evolving on a set O C R? for some p > 1. It plays the same role 
as the probing signal introduced for gradient-free optimization in Section 4.6, with applications to 
policy gradient algorithms in Section 4.7. 

In the theoretical development of QSA it was convenient to assume that the exploration itself 
evolves according to the autonomous state space model (4.79). In discrete time we make the change 
of notation: 

E(k + 1) = H(E(k)) (5.21) 
in which H: QO + QO is continuous. Subject to the policy (5.20), it follows that the triple ®(k) = 
(a(k), u(k), E(k))™ has a similar recursive form, evolving on the larger state space Z. In some cases, 
such as in TD(,) learning, it is necessary to add additional components to ®(k), and extend the 
state space Z. This is the reason for the abstract description of ® in Assumption (A&) below. 


A few words on ergodic averages Assumption (A&) that follows is similar to Assump- 


tion (QSA5) appearing in Section 4.9, in that both concern averages of observations. In the 
discrete-time setting of this chapter we denote 


1 N 
av = = d9(®(4)) 
k=1 


for g: Z + R continuous, and N > 1. The main assumption is the existence of a limit, known as 
the ergodic mean (also known as ergodic average or expectation): 


Eolg(®)] = lim giv (5.22) 
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The expression Eg[g(®)] may be regarded as convenient notation, but in many cases it is represented 
as an integral: the probability measure w has a density pe, so that 


Eola(®)] = [ g(2)0(2) dz 


Here is a simple result to illustrate the origin of a density: 


Lemma 5.2. Consider the scalar probing signal E(k) = sin(27k/T), k > 0. Provided T is an 
irrational number, for any continuous function g: R > R we have 


1 1 
ties 3 g(E(k)) = | jena ar / g(t)e(t) dt, (5.23) 


N->oo ke 0 =i 
where p(t) = [xV1 — t2]~! is known as the arcsine density. 


Proof. Consider first the signal &°(k) = [k/T],, where [r]; denotes the fractional part of a scalar 
r € R,. This signal samples points uniformly in the interval [0,1], giving for continuous functions 
h:R-R, 


k=1 0 


The first equality in (5.23) follows on taking h(&°(k)) = g(sin(27&°(k))) = g(&(k)). The second 
equality is a calculus exercise. O 


It will simplify some analysis to impose uniformity of the limit (5.22) over Lipschitz continuous 
functions. For any L > 0 denote 


Gi = {9 : |l9(z’) — g(z)I| < Lllz— 2'l|, for all z, 2! € Z} 


(Aé) The state and action spaces X and U are each closed subsets of Euclidean space; 
F defined in (3.1), defined in (5.20), and H in (5.21) are each continuous on their 
domains. There is a larger state process ® with the following properties: 


(i) ® evolves on a closed subset of Euclidean space, denoted Z, and (x(k), u(k), &(k)) = 
w(®(k)) for each k, where w: Z + X x U x CO is Lipschitz continuous. 
(ii) There is a probability measure @ such that for any continuous function g: Z > R, 
the ergodic mean (5.22) exists for each initial condition (0). 
(iii) The limit in (5.22) is uniform on Gz, for each L < oo: 
lim sup |gn — Ea[g(®)]| = 0 
N-oo gEGL 
The “quasi-randomized” policy structure defined by eqs. (5.20) and (5.21) is imposed so that 
the ergodic limit (5.22) can be expected to exist. Please remember that these assumptions are 


not essential for successful implementation of algorithms. They are introduced only to simplify 
analysis. 
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ODE approximations Just as in the previous chapter, ergodicity allows for approximation of 
algorithms by simpler ODE approximations. In particular, consider a recursion of the form 


On+1 = On + Onwi india) ; n>0 (5.24) 


in which {f,,} is a sequence of functions that admits an ergodic limit: 


7(0) # tim ~~ fi(0), OER! 


The associated ODE is defined using this vector field: 
aot = F(91) (5.25) 


An ODE approximation is defined by mimicking the usual Euler construction: the time-scale 
for the ODE is defined by the non-decreasing time points t) = 0 and tT, = d°j ax for n > 1. 
Define a continuous time process by ©,,, = 9, for each n, and extend to all t through piecewise 
linear interpolation. Let {8? : t > T,} denote the solution to the ODE (5.25) with initial condition 
bd? = 6n. We then say that the algorithm (5.24) admits an ODE approximation if for each initial 
6) and N > 0, 

lim sup ||O, — 82|| =0 (5.26) 
NO 2, << Ty +N 
If the parameter sequence {6,,} is bounded, then it is often easy to establish (5.26) by following 
the steps of Prop. 4.28. We can then follow the proof of Thm. 4.15 to establish convergence of the 
parameter sequence whenever (5.25) is globally asymptotically stable. 


100 100 
= 1.0 = 0.9 
a Ox, 80 pP 80 p 
TN = 2.8 TN = 5.3 
oe v 60 N 60 N 
Tk 
40 40 
20 20 
(0) 0 
() 5 10 Tk 0 5 10 Tk 


Figure 5.4: ODE approximations for root finding. 


Fig. 5.4 is adapted from Fig. 8.5, illustrating a version of (5.24) in which f,41(0) is random for 
each n, with mean equal to f(@). The definition of an ODE approximation (5.26) is unchanged in 
this stochastic approximation setting. 

The three plots each compare ©,, = 6; with 9, (87, with n = 0), and are distinguished by 
choice of step-size: a, = 1/n?, with p = 1.0,0.9,0.8. In each case the algorithm (5.24) was run for 
a common choice of {f,,}, and common time-horizon 1 < n < N = 10°. The significant difference 
observed in these plots is how p influences the range of T,: the value N = 10° corresponds to 
ty <3 for p=1, while ty > 10 for p= 0.8. 

The approximation 6; ~ 9x, is unusually tight in this example. The explanation is the large 
initial condition: the compression of the vertical axis masks volatility of the parameter estimates. 

The more aggressive high gain obtained with larger p leads to faster convergence of {8 : tT > O}, 
but in some cases this introduces unacceptable volatility in the parameter estimates {6, :n > O}. 
These remarks echo the theory of QSA outlined in Section 4.5.4 and formalized in Thm. 4.24. 
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5.3 TD-learning and Linear Regression 


TD learning refers to methods to approximate a value function for a fixed policy @. This may be 
just one step in an approximation of the policy improvement algorithm introduced in Section 3.2.2, 
which requires estimation of J” to be used in the policy improvement step (3.14). 


5.3.1 Fixed-policy temporal difference 


The second glance ahead discussion, contained in Section 3.7, included an informal introduction to 
approximate policy improvement. This approach requires estimates of the fixed-policy Q-function 
to obtain the policy update: 


o" (x) =argminQn(x,u), © EX 


in which Q, solves or approximates (3.45). Section 4.5.3 contains an example intended to illustrate 
this approach. 
For any policy with associated value function J®, the fixed policy Q-function is denoted 


Q*(x,u) = c(x,u) + J* (F(z, o(2))) 
In this notation, the fixed point equation (3.45) becomes 
Q° (x, u) = c(x,u) + Q°(at,u*), gt =F(a,u), ut = o(at) (6.27) 


For any approximation @, we can observe the error in this fixed point equation as another temporal 
difference: for any input-state sequence (u, a), denote 


Dr+i(Q) = —Q(a(k), u(k)) + e(a(k), u(k)) + Q, (a(k + 1)) (5.28a) 
Q,(x) = Q(a,0(@)),  reEX (5.28b) 


The temporal difference (5.28a) is zero for all k if we substitute Q® for Q. 

Algorithms to approximate Q® based on the temporal difference sequence (5.28) are called 
SARSA. These algorithms are only a minor variation on the TD-learning algorithms designed to 
estimate J®, so we opt for the simpler terminology “TD-learning” throughout the book. 

There are two distinct flavors of TD-learning: on policy and off policy. The on-policy versions 
choose u(k) = (a(k)) in the definition (5.28a). The difficulties with on-policy algorithms should 
be clear following the discussion regarding exploration in Section 2.5.3: if @ is a good policy, in the 
sense that 2(k) > 2°, u(k) = (a(k)) > u® as k > on, then for any function Q 


dim Dyyi(Q) = Jim {-Q(e(R), u(k)) + (x(k), u(h)) + Q(a(h + 1), be(k +1))} 
=a i) if u(k) = b(a(k)) for each k 


(5.29) 


Consequently, under the convention c(*, u°) = 0, the temporal difference error approaches zero for 
any choice of Q. 

In this part of the book we focus mainly on off-policy algorithms designed to allow for explo- 
ration. The elegant theory for on-policy algorithms in stochastic control is explored in Part II. 
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5.3.2 Least squares and linear regression 
Consider the linear parameterization introduced previously in (3.43): 
Q°(a,u) =OTY(a,u), OER? (5.30) 
Given the assumption that Q(2°,u°) = 0, it is important to construct the function class with this 
in mind: 
wie a“ j\=0 Laid (5.31) 
From the definition (5.28a) we obtain 
Dui(Q’) = —Q*(a(k), u(k)) + e(a(k), u(k)) + QF (a(k + 1) 
This can be expressed in a form that will inspire a budding statistician. On denoting 
Vr = c(a(k), u(k)) (5.32a) 
Lert = ¥(2(k), ulk)) — (ak + 1), o(x(k + 1))) (5.32b) 


we obtain the representation 


Vr = Thy + Drsi(Q") (5.32c) 
This is the form of a standard regression problem: 
Ve= Te ae + Ek 
where {e, = Dyi1(Q® ): k > 0} is regarded as “noise”, and * is typically defined as the minimum 
variance parameter: 6* = arg mingI(0), with 


N- 


(0) = Ea[[vo - 110)"] |= Jim ar i419 (5.33) 
N imo 


This is the mean-square error for the temporal difference sequence: applying (5.32c), 


Convergence of this limit requires conditions on the input, and further conditions are required so 
that this loss function is meaningful. In particular, for the on-policy approach in which (5.29) 
holds, T(@) = 0 for every @! This is why exploration is needed. Exercise 5.2 illustrates design of the 
probing signal for the special case of LQR. 


Least Squares Temporal Difference Learning (LSTD) 


For a given d x d matrix W > 0, integer N, and observed samples {u(k), x(k) :0<k < N}, the 
minimizer is obtained: 


N-1 
ae = argminTy (0), Ty (0) = OTWO + Ss I Vk 7 Taq (5.34) 
: k=0 


This defines the approximation of the Q-function: Qon a ON (a) i. 


Pre-publication draft -- March 25, 2022 


CHAPTER 5. VALUE FUNCTION APPROXIMATIONS 175 


The objective is a positive definite quadratic, so the solution to (5.34) is obtained on setting 
the gradient of the objective to zero: VI'y (0) =0 for 0 = 04. 


Proposition 5.3. OS™ = [N-'W+ Ry] !bn, with 


re {Xa 
Rn = NW pperaee On = WV VR+1Vk 
=i k=0 


O 


The regularizer 6TW6@ is introduced to ensure a unique solution. It is worth investigating the 
implications if Ry is not invertible. 


Proposition 5.4. Suppose that Rx has rank less than d. Then, there is a non-zero vector v € R4 
for which the following hold, for eachO<k<N-—1: 


(i) For any @€ R¢ andr ER, 
Drti(Q®) = Dryi(Q*), with & =O +rv 
(ii) For the on-policy implementation, 
vT¢(x(0), u(0)) = vt (a(k), u(k)) 
Hence the basis falls into the “redundant” category discussed in Section 5.1.5. 


Proof. If Ry does not have full rank, it then follows that there is a non-zero vector v satisfying 


vu’ Rnxv = 0. By definition: 
N-1 


1 a 
0=v'Ryv= WV » (oT Ti41) 
k=0 
That is, v' TY; = 0 for every observed sample, which means 


0=v'uv(a(k), u(k)) — old (a(k +1), b(a(k + 1))), O0<k<N-1 (5.35) 


Part (i) then follows from (5.35) and the definition (5.28a): for any scalar r, with 6 = 6+ rv, 


De+i(Q”) = —Q* (w(k), u(k)) + e(a(k), u(k)) + Q* (2(k +1), b(@(k + 1))) 
= c(x(k), u(k)) + [8+ ro]"[-Y(w(k), uk) + bak + 1), (ak + 1) 
= c(a(k),u(k)) + 67[-w(a(k), u(k)) + v(a(k + 1), o(@(k + 1)))] = Daz (Q”) 


If u(k) = b(x(k)) for all k then (5.35) becomes 
oTh(w(k),u(k)) = oTPa(k+1),uk+)), O<k<N-1 


which implies (ii). O 


On-policy algorithms are sometimes preferred because of ease of analysis (mainly in the context 
of stochastic control). To ensure sufficient exploration, it may be best to go with the re-start option 
introduced in (2.55): 
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Least Squares Temporal Difference Learning (with re-start) 
For a given d x d matrix W > 0, is N and M, and observed samples 
{u'(k 20514 N,1<1<M} (5.36a) 


with user-defined initial conditions {x'(0) : 1 < i< My}, and with input u!(k) = 6(a7(k), &'(k)). 
ots LSTD 


The approximation of the Q-function Q°V — = 16%" is obtained, in which the optimal pa- 


rameter is defined by the following steps: 


(i) Introduce a per-batch loss function [4,(0): defined by (5.34) using the ith batch, B’ = 
{ut(k),2*(k) :0<k < N}. 


(ii) Define 07'° = arg ming Py (0), with 


1 M 
eT S Tv (8) (5.36b) 
i=1 


This approach does not rule-out u’(k) = @(2"(k)) for each i and k (on-policy). 
Note that analysis of RL algorithms with restart require a modification of Assumption (A&). 
5.3.3 Recursive LSTD and Zap 


LSTD learning is often presented as a recursive algorithm: 


Proposition 5.5. LSTD learning in the form (5.34) admits the recursive representation: 
Ona = ON + Gn4iTnailyn — Thi, 97] (5.37a) 
1 
Gni1 = Gn - —GyTnuitTyiiGn, N>O (5.37b) 
kn41 


with kn4y =1+ Thai Gn Tt v4. 


Proof. The recursion is an application of Prop. 5.3: letting Gy = [W + NRy]~+, the proposition 
implies that for N > 0, 
Cn Ona =(N+1)0N41 
and Gyiqg=Gy + Truth 
The recursion (5.37b) follows from the Matrix Inversion Lemma (A.1). 
To obtain (5.37a) we apply this recursion for ~\,, also implied by Prop. 5.3: 


(N + 1)dNa1 = NbN + Tyan 


Consequently, 
Gynt =(N+ 10Nn4s 
= Noy + Tnsiyn 
= Gy ON + Tni1yn 
= {Gy — Tai thy }ON + Trsiyy 
Multiplying each side by Gy 4; and rearranging terms establishes (5.37a). O 
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The algorithm (5.37) can be represented as a QSA recursion with matrix gain: 
On+ = ON + on4 Rv Twsilyn — Ty 419h] 


in which ay = 1/N and 
N 
1 
Ry = W [w + > TT] 


Back to Zap We might have arrived at this algorithm through the ODE method, in which the 
recursive algorithm is designed to approximate a matrix-gain ODE 


£9, = Mf (9) 


Given our objective to minimize the loss function (5.33), it is natural to take f(@) = —5VT (0) and 
M positive definite. The recursion (5.37a) is an approximation of this ODE using M = Eg[Y,Y]]~'. 
We can also write 


M~™ = Eg[TeV]] = 50$Eol(Ye-1 — Y]9)?] = 5047 (0) = —Oof (8) 


Hence LSTD is a single time-scale approximation of Zap QSA. 


5.4 Projected Bellman Equations and TD Algorithms 


LSTD learning suffers from two computational challenges: 


(i) What do you do if d = 10°? Practitioners in machine learning often face high dimensional 
optimization problems, and claim that dimensions of one million are no longer a concern. 
This success story is attributed to advances in optimization theory, computer engineering, 
and computing power. 


(ii) How can LSTD be extended to nonlinear function approximation, such as when each 6; is a 
weight in a neural network? 


In the RL research community it is typical to modify the objective in order to reduce computational 
complexity. A favored approach is to create algorithms that obtain solutions to a projected dynamic 
programming equation. While there is little supporting theory, the algorithms inspired by this 
viewpoint have been highly successful with neural network function approximation. 

The motivation behind these algorithms requires a bit more background on function approx- 
imation, described here using the notation of Section 5.1. We begin with an abstraction: find a 
function h* that solves a fixed point equation: 


h* = T(h*) (5.38) 


The specifics of the domain and range of h*, and the meaning of the mapping 7’, depends on the 
problem we would like to solve. The DP equation (3.5) is one example, with h* = J*. If solving 
(5.38) is intractable, we might seek an approximation. 

We choose a function class H and a mapping Py: H > H. That is, Px(h) € H for any h € H. 
More conditions on this mapping will be imposed below. We then introduce an approximation of 
(5.28); 

h =T(h) & Py{T(h)} (5.39) 
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In some cases this approach fails or is too complex, so we consider an alternative: given a second 
function class G, find a function h € H solving 


0 = Po{h—T(h)} (5.40) 
This is a generalization of (5.39): 


Proposition 5.6. Suppose that the following hold: 
(i) H=G 
(ii) H is a linear function class: ayhi + aghg € H whenever hi,h2 © H and aj,a2 € R. 
(iii) The mapping Py is linear: for hy,h2 € H and aj,a2 € R, 
Pa (ayhy + aghz) = a, Py(h1) + a2 Py (h2) 


Then the solutions to (5.39) and (5.40) coincide. O 


When we put these ideas to practice in control, the mappings Py and Pg are defined to be 
projections (formally defined below). T will define the projected Bellman operator, and (5.40) is 
called the projected Bellman equation. 


5.4.1 Galerkin relaxations and projection 


We begin with assumptions on G and the mapping Pg. 

It is assumed that each g € G is a function g: Z > R, and with Z the larger state space 
used in Assumption (A&). The function class G is also assumed to be linear: a1g; + aago € G 
whenever g1,g2 € G and aj,a2 € R. To define what is meant by projection requires geometry: 
The expectation introduced in (A&) is used to define an inner product and norm on functions 
hy, hg :Z—R: 


(hi, h2)o = Ealhi(®)h2(®)], I[hillo = VEo[(hi(®))?] = (ha, ha) 


The function class L2(@) is defined to be all functions h for which ||h||q is finite. For any h € Lo(a), 
the projection onto G is defined as 


h = Pg(h) = arg min{\||g — hilo: 9 € G} 
g 


The optimizer h €G satisfies the orthogonality principle 


(h—-h,g)a=9, Gg EG (5.41) 


We henceforth assume that G has finite dimension: we choose d functions {G; : 1 < i < d}, 
stack these together to define a function ¢: Z + R*%, and then define G = {g = 07: 6 € R“}. We 
denote ¢(k) = C(®(k)), and call this the sequence of eligibility vectors, since they will play a role 
in a Galerkin relaxation (first introduced in (5.12)). 
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Proposition 5.7. Suppose that C; € L2(@w) for each i, and that these functions are linearly 
independent in Lo(w). That is, ||0™C||o = 0 implies that 6 =0. 
For each h € L2(m) the projection exists, is unique, and given by h = 6*'C with 


o* = [R-1p" (5.42) 
where pe € R¢@ and the dx d matrix R° are defined by 


De = (Gr h)o 


. - (5.43) 
RGD) — (Gis Cio, 1 aS tJ < d 
Proof. The orthogonality principle (5.41) tells us that for each i, 
(h _ h, Cio =0 
Combining this identity with the representation h = 6*7C completes the proof. O 


Prop. 5.7 is the motivation for Galerkin approaches to root finding, generalizing the sample 
path definition (5.12): 


Proposition 5.8. Equation (5.40) holds if and only if 
O=(G,h-Tlh))a, 1<i<d (5.44) 


This is by definition the Galerkin relaxation of (5.38) in this Ly setting. O 


5.4.2 TD(A)-learning 
The fixed-point equation (5.27) is of the form (5.38): define for any function h: Xx U>R, 


T(h) = c(z,u) + h(at,ut), at =F(2,u), ut = o(27) 


(xu 


so that Q? = T(Q°). 

Galerkin relaxations lead to the oldest and most celebrated RL algorithms. Consider specifi- 
cation of H as a finite dimensional function class {h = 67) : 6 € R%}, where yj: X x U > R for 
each i. We arrive at the projected Bellman equation by applying the approximation (5.39) to this 
problem, in its equivalent form (5.44): for each i, 


0 = Ea [Gi(k){h(a(k), u(k)) — [e(w(k), u(k)) + halk + 1), O(e(k + 1))1F] 


where we have used the definition of the inner product, along with the notation ¢(k) = C(®(k)). 
The solution of this root finding problem defines Q® € H. 
Recalling the definition (5.29), the projected Bellman equation is equivalently expressed 


0 = Ea[¢(k)Pr41(Q”)| lycoe (5.45) 
Given N observations, an approximation is obtained via 
, 62 
=e k)D e A 
0= 5D 6PLvs( ) (5.46) 
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which is precisely of the form (5.12). 
We can now define the meaning of “A” in TD(A) learning, which depends entirely on the choice 
of G. To keep notation more compact denote 


bry = v(w(k),ulk)) , cy Se(x(k),u(k)), k>0 (5.47) 


and when possible use ¢; instead of ¢(k) for the eligibility vector, and we denote Qs (x) = Q° (x, b(2x)). 


TD()) Learning 


For a given \ € [0,1], non-negative step-size sequence {a,,}, initial conditions 69, ¢9, and observed 
samples {u(k),2(k) :0<k < N}, the sequence of estimates are defined by the coupled equations: 


On+1 = On + AntiPnyiGn (5.48a) 
Dn+i(Q) = —Q™ (a(n), u(n)) + en + Q4 (x(n + 1) (5.48b) 
Cnt1 = Mn + P(n41) (5.48c) 


This defines the approximation of the Q-function Q°Y = >, On (i) Ui. 


For the purposes of analysis we require an extension of the state process: 
O(k) = (x(k), u(k), E(k), ¢(K))" 


so that ¢(k) is a linear function of ®(k). 
Denote f\(9) = Eo [¢(k)Dx41(Q*)]. TD(A) is an approximation of the ODE 


$9 = f,(0) 
The right hand side is linear: f\(@) = A(@— 6*), in which 
A= Ea [lh vay + Wolk + 1), b(0(k + 1))F) (5.49) 


Convergence of TD(A) learning is not guaranteed in the off-policy setting, even with A = 0. A 
famous example is described below Fig. 5.5. 

What about neural networks? If H is not a linear function class then we lose much of the 
convergence theory, but the algorithm can be salvaged: 


TD(A) Learning (with nonlinear function approximation) 


For a given X € [0,1], non-negative step-size sequence {a}, initial conditions 09, ¢9, and observed 
samples {u(k),2(k) :0<k < N}, the sequence of estimates are defined by the coupled equations: 


Oni = 0, + nd PasaiGn (5.50a) 
bo = AGn + GG (5.50b) 
62 = V9Q4 (a(n), uln))] pg, (5.500) 


and with Dn4i defined as in (5.48b). 
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This is a generalization of (5.48) since ¢, = Wm) if H is a d-dimensional linear function class. 


Unstable dynamics with perfect exploration Fig. 5.5 shows Baird’s counterexample, which 
provides an example of instability when using TD(A). There are six states and no control, with 
cost identically zero. From any initial condition, the state remains at 2(k) = 7 for all k > 1. Hence 
Q* (x, u) = h*(x) = 0 for each x. Note that here 6’ denotes the ith index of the vector 0. 
The example violates two conventions in this chapter: 
(i) w(2®) = (0,...,0,1,2)7 40 with x° = 7, so that (5.31) is violated. 
(ii) There are only seven states yet the basis is 8-dimensional. This implies that the sample 
correlation matrix (5.19) is never full rank, for any value of NV. 


The degenerate correlation matrix shouldn’t be such a concern, since h® = h* and also D®", = 0 
for all n with 6* = 0. Rank degeneracy implies that 6* is not unique. 


QVOOOO® on ams, O42" c=k<6 
Sie He) = 00) = oi ga 
~ 


Figure 5.5: Baird’s star problem. 


The example was introduced in the discounted cost setting, with discount factor y < 1. This is 
maintained here, and since there is no control we use value function notation to define the temporal 
difference: 

Dhar = —h?(a(n)) + 7h9(a(n + 1) = —h*(w(n)) + yh°(7) (5.51) 
where the second equality holds because of the trivial dynamics. The definition of the eligibility 
vector is modified slightly in the discounted setting: 


Cnt =AVGn + (a(n + 1)) 


So called “perfect exploration” is based on a version of re-start, in which episodes allow only a 
single transition, and each of seven initial conditions is sampled uniformly. A periodic implemen- 
tation is described as follows: for n = 0,1,...,6 choose x(n) = n+ 1 and obtain 


Dra = h(a +1) + 9h (7) 


where {6,,} are updated using TD(A): 0n41 = On + An41Pn4iGn. This procedure is repeated, so 
that x(n) — 1 =n modulo 7 for all n > 1. 

The algorithm diverges for y < 1 sufficiently large and some initial conditions (under our 
standing assumption that )> a, = co). To see this requires a closer look at the temporal difference: 


p, ., _ J-108 + 264] + 71268 + 07] a(n) =k <6 
nti) —[208 + 67] + 4[208 +67] a(n) =7 


The source of instability is revealed by looking at the evolution of the last entry of the parameter 
estimate: 
2 iy ee L)O8 +7, — 26K}CR a(n) =k <6 
" " [anti{-(1 — 268 + ORIG a(n) =7 
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Suppose that y > 5 and 6§ > 0 is large relative to the other parameters. Whenever x(n) = k < 6, 
the estimate 6°, tends to increase, because of the positive coefficient (2y — 1). If z(n) = 7 then 
the coefficient of 68 becomes negative, but remember x(n) = 7 occurs only 1/7 of the time. 


eig(A) eig(A) | } | — 6! 
1 = n 
Yo} : ‘ 5 
a x H —4 
° =) | =e 
1 E = —7 
Wala Fi J H A = 
< =| “KS i = 
x i = 
on eo a S 
zs = => 2 
.o Bl 


Figure 5.6: Baird’s star problem: parameter estimates and eigenvalues of A with y = 0.9. (a) \ = 0 (b) A = 0.75. 


In this example, TD(A) can be expressed as the linear recursion 


On41 = On, + On41An+19n ; An+1 = Cn{—(x(n)) + y(7)}T 


The matrix (5.49) becomes 


Based on the recursion for 68 above it may not be surprising that A is not Hurwitz for all values 
of y. Fig. 5.6 shows results from two experiments with y = 0.9, two values of A, and fixed step-size 
An = Ao. 

The parameter estimates diverge in these two experiments, and the behavior is as predicted 
from the eigenvalues of A. With \ = 0 the eigenvalues in the right half plane are complex, and in 
this case the estimates oscillate to infinity. 


In the remainder of this section we turn to approximation of Q*: the Q-function associated 
with the optimal control problem. 
5.4.3 Projected Bellman operator and Q-learning 


The Q-function for the total cost optimal control problem solves the fixed point equation (3.7d), 
copied here for convenience: 


Q*(x,u) = c(x, u) + Q*(F (2, u)) 


with Q(x) = min, Q(z, u) for any Q. For a parameterized family of approximations {Q° :6 € R%}, 
recall that for each @ we define a policy via (3.44): 


(x) =argmin@*(z,u), «eX 


We obtain an algorithm for approximation via “pattern matching” with (5.48). The algorithm with 
linear function approximation is presented here: 
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Q(\) Learning 


For a given \ € [0,1], non-negative step-size sequence {a}, initial conditions 09, ¢o9, and observed 
samples {u(k),7(k) :0<k < N}, the sequence of estimates are defined by the coupled equations: 


On41 =O, + ni PrziGn (5.52a) 
Cnt = Mn + V(n41) (5.52b) 
Dna = —Q°" (a(n), u(n)) + cn + Q*" (a(n + 1), "(a(n + 1))) (5.52c) 


with Yuya) = Y(e(n + 1),u(n-+ 1), en = e(e(n),u(n)). 


The algorithm has significant similarities as well as differences with TD-learning: 


A The major change is (5.52c): the current policy estimate o*" is used, rather than the fixed 
policy @ in TD(A). Note that in (5.52c) we can substitute 


QP (x(n + 1), 6° (a(n + 1))) = QP (a(n + 1)) = min Q* (a(n + 1), u) 
A QSA theory predicts that a limit 6* for Q(A)-learning will solve f(*) = 0, with 


FO) =EalfpuO), Fil?) =Pra(Q")en (5.53) 


This appears the same as TD(A) until you recognize the different definition for Dy+1(Q*) 
obtained from (5.52c): 


Dr+i(Q°) = —Q"(2(n), u(n)) + en + min Q"(a(n + 1), u) 


A For the special case \ = 0, we can apply Prop. 5.6 to conclude that Q® solves the projected 
Bellman equation 
Q” = Px{T(Q* )} 
in which the Bellman operator is redefined: TQ") w= c(x,u) + Q(F(a, u)). 


Sadly, these observations bring us to a dead-end. An ODE analysis of Q(A)-learning requires 
that we look at global asymptotic stability of the ODE with vector field f defined in (5.53). As 
a first step, we must find conditions under which the root finding problem admits a solution! 
Unfortunately, very little is known about existence, even in the case \ = 0. And, if an equilibrium 
does exist, stability theory is nearly absent. 


5.4.4 GQ-learning 


If we are concerned that f(6*) = 0 does not admit a solution, then we might turn to the next best 
option: for a given d x d matrix M > 0, solve 


min T(0) = min 5f(0)'M f(6) (5.54) 
We can then apply the ODE method to devise an algorithm. One approach is gradient descent: 


aot = [ef (81) MF(82) (5.55) 
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The GQ-learning algorithm of [235] can be regarded as a discrete-time translation of this ODE, 
using M = ElgnGa]*. Matrix inversion is avoided through a two time-scale algorithm: an ODE 
approximation of M f(8;) is obtained using the solution to the “high gain” ODE 


4 wr, = by [f (84) _ Ra; 


where R = M~! = E[G,Gh]. Provided {b:} is chosen very large and {9} is bounded, it is not 
difficult to establish that the approximation w; ~ Mf (9,) will hold after a transient period. 

These ODEs are motivation for the two time-scale GQ algorithm, presented here for linear 
function approximation: 


GQ()) Learning 
With the same starting point as Q(A) learning, and an additional initialization wo € R?: 


Ona = On — Gn ipAl On (5.56a) 
Wn+t1 = Wn + Briard faiaOn) = CbiG) ag tin | (5.56b) 


Cnt = AGn + V(n+1) 
Dn4i = —Q™ (x(n), u(n)) + en + Q™ (a(n + 1) 


FatalOn) = DriiGn ’ Ayaat = 06 fn+1 (On) = Cn{-Wn) + Yash (5.56c) 


with Wngi) = (a(n + 1), u(n + 1)), and $= ¥(a(n + 1), 6% (a(n + 1))). 


As always, the output of the algorithm defines the final approximation Q®”. The ODE approx- 
imation is successful provided the second step-size sequence is relatively large: 


lim — =oo (5.57) 


There are several challenges and questions: 


> We may have difficulty obtaining a global minimum of (5.54) because the objective T is not 
convex. 


> Suppose that in fact f(6*) = 0 does have a solution. Challenges remain with any approxima- 
tion of the ODE (5.55), beyond lack of convexity of the objective function. Nesterov discusses 
this approach to root finding in his monograph [273, Section 4.4.1]. He warns that it can lead 
to numerical instability: “...7f our system of equations is linear, then such a transformation 
squares the condition number of the problem’’.'° He goes on to warn that it can lead to a 
“squaring the number of iterations” to obtain the desired error bound. To see this, consider 


the second-order approximation of the loss function at 0*: 
T(0) = T(6*) + (8 — 6*)T[A* MA*"|(6 — 6*) 


which uses f(6*) = 0. The appearance of A*M A*" is the “squaring” that Nesterov warns 
about. If A* has a large condition number, then we may make things much worse by squaring. 
The numerical challenge can be addressed with an alternative choice for M, provided this can 
be done without introducing additional complexity. 


‘More on the curse of condition number can be found in Section 8.5.1, along with definitions. 
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> Last but not least, it is not obvious that minimizing (5.54) is a worthwhile goal. Further 
research is needed to explain why Q(A) and GQ algorithms are often successful in practice, 
and to predict when these algorithms might fail. 


5.4.5 Batch Methods and DQN 


The Deep Q Network (DQN) algorithm was designed for neural network function approximation; 
the term “deep” refers to a large number of hidden layers. The basic algorithm is summarized here, 
without imposing any particular form for Q?. 

One component of this approach is to abandon the purely recursive form of the preceding 
RL algorithms. In a batch RL algorithm, the time-horizon N is broken into B batches of more 
reasonable size, defined by the sequence of intermediate times Tg = 0 < Ty, < Th <---< Tp_i1 < 
Tg = N. The potential benefits are more obvious when we come to RL design using kernels for 
function approximation, or for stochastic control systems. 


DQN 


With 69 € R® given, along with a sequence of positive scalars {a,}, define recursively, 


1 
= arg min{ I (6) fe igs On|? ,  O<n<B-1 (5.58a) 
6 An+1 
where for each n, with rp = Tn41 — Th, 
1 Tn41—-1 
Ta(0) = 3— S> [-Q° (x(k), u(k)) + cx + Q™ (w(k +:1))]? (5.58b) 
" k=Tn 


The elegance and simplicity of DQN is clear when Q? is defined via linear function approxi- 
mation, so that (5.58a) is the unconstrained minimum of a quadratic. The following summarizes 
obvious yet suggestive properties of the solution to (5.58a) for both linear and nonlinear function 
approximation: 


Proposition 5.9. Suppose that {Q°(x,u) : 0 € R%} is continuously differentiable in @ for each 
z,u. Then 


(i) The solution to (5.58a) solves the fixed point equation 
Tn41—-1 


Prat =On+Ont4i— YS) [-Q*(a(k), ulk)) + Yn(k)] VoQ* (a(k), u(k)) i (5.59) 
" k=Tr See 


with Yn(k) = ce + Q™ (x(k + 1). 
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(ii) If the parameterization is linear, so that VgQ?(x(k), u(k)) = vq), then 


Ont = On + Ons1{AnPn4i1 — ba} (5.60a) 
Tr41 1 

with Ayn = a > bw Vx) (5.60b) 
k=T, 
1 Tn41—-1 

aa » ¥n(k) va) (5.60c) 


The linear case is particularly simple since we can solve (5.60a) by rearranging terms and 
inverting: 
On+4 _ [I _ angi An] {On _ iy jabiy | 


The approximation [J — On41An])+ = I+an41An holds if a,41 is sufficiently small, and from this 
we obtain 


On+1 ~~ [I + Ont41An| {On — On+1bn } ~y On, + Ons itAnGny, _ br} 


For nonlinear function approximation it is not clear why the optimization problem should be 
solved at each iteration. Under the assumptions of Prop. 5.9 we have ||O@n,41 — 6n|| < Kan41 for 
some fixed K < oo, whenever the parameter sequence {@,,} is bounded. Consequently, 


Tn4i-l 


On44 7 On + a a > [-Q* (x(k), u(k)) + ¥n(k) af En+a|VeQ™ (x(k), u(k)) 
Oe heat, 


where ||E,+41|| < O(@n+41). This motivates the batch QSA approximation of DQN: 


Batch Q(0) Learning 


With 69 € R® given, along with a sequence of positive scalars {a,}, define recursively, 


Tn41i-1 
On+1 = On + Ant1— Drii(On)VoQ* (x(k), u(k 
et oe X k+1 (On) VeQr” (x(k), u(k)) (5.61) 


Drst(On) = —Q*" (x(k), u(k)) + cg + Q* (x(k + 1)) 


While DQN is easy to implement, it does not resolve the issues surrounding Q(A)-learning: 


Proposition 5.10. Consider the DQN algorithm with possibly nonlinear function approxima- 
tion. Assume that Q® is continuously differentiable, and its gradient VQ°(x, u) is globally Lipschitz 
continuous, with Lipschitz constant independent of (x,u). Suppose that B = co, the non-negative 


step-size sequence satisfies 
y Qn = OO, ) a2 < 00 


and suppose that the sequence {0,} defined by the DQN algorithm is convergent to some Oo. € R¢. 
Then 
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(i) f(Oc0) =0, with f defined via (5.53) using CG, = VoeQ* (x(n), u(n)) jh 


(ii) The algorithm admits the ODE approximation 49, = f (91) 


O 


These conclusions should raise a warning, since we do not know if f defined in (5.53) has a root. 
If we manage to establish the existence of a solution to f(0,.) = 0, we do not know if the ODE is 
stable, or if 64. has desirable properties. 

On the other hand, DQN is used every day with success, so it should not be ignored. 


5.5 Convex Q-learning 


We have surveyed three classes of algorithms within the “TD taxonomy”: 


(i) Approximate PIA using LSTD or TD(A). We can be assured success under two conditions: 
(a) the function class is linear, and (b) the function class is complete in the sense that we can 
be assured that Q° = Q®" for each n. 


(ii) GQ learning to obtain the minimal mean-square Bellman error. We are assured success if @Q* 
lies in our function class, and the objective function satisfies conditions aligned with gradient 
descent (such as the PL condition introduced in Section 4.4.2). 


(iii) Galerkin relaxations of the DP equation are obtained using Q(A) learning, DQN, or Batch 
Q(0) learning. Here theory is almost non-existent. 


The RL algorithms surveyed in this section are all motivated by the “DPLP” (3.36). We begin 
with a direct approach based on a joint parameterization: in the notation of Section 5.1, H = 
{h° = (J°,Q°) : 0 € R4}. The value 6; might represent the ith weight in a neural network function 
approximation architecture, but to justify the adjective conver we require a linearly parameterized 
family: 

F(x) = Op" (x), — Q°(x,u) = OTY(az, w) 


The function class is normalized with J°(a°) = 0 for each 6. For the linear approximation ar- 

chitecture this requires w? (2°) = 0 for each 1 < 7 < d; for a neural network architecture, this 

normalization is imposed through definition of the output of the network. Convex Q-learning 

algorithms based on a reproducing kernel Hilbert space (RKHS) are described in Section 5.5.2. 
Consider the translation of (3.36) based on a parameterized family: 


max (u,Q?) 
st. Q°(x,u) < c(a,u) + J°(F(a, uv) (5.62) 
Q°(z,u) > J*(x) rEX, we U(x) 
where ut is now a weighting function on X x U. 

We can if we wish strengthen the first constraint to equality: Q°(x,u) = c(x,u) + J°(F(a, u)). 
This is reasonable if we have a good model: first create a parameterized family {.J°}, and then 
define Q(x, u) = c(x,u) + J°(F(a, u)) for each 6, x, u. 

However, if we don’t have a highly accurate model, it is then better to relax the equality 
constraint, which is why (5.62) is our preferred starting point. 
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Another inequality that might be removed from (5.62) is Q? > J®, since the optimal solution 
for the DPLP results J* = Q*. We can remove J ® from (5.62) by imposing this constraint: 


max (1, Q") 


(5.63) 
st. Q°(x,u) < c(x,u) + Q° (F(z, u)) rEX, we U(x) 


or even 


max Go") subject to the inequality constraints in (5.63). 


with v a weighting function on X. With either objective function, this remains a convex program 
provided the function class is linear in 6. And through design we must ensure that Q°(x°) = 0 for 
each 6. This might be achieved by imposing y;(x°, u) = 0 for all 7 and u, and recognizing that it 
is reasonable to assume that U(a°) = {u°} (once we are at the equilibrium with zero cost, there is 
no reason to leave). 

The algorithms in this section are based on approximations of (5.63). Several options are 
surveyed in the following. 


5.5.1 Convex Q learning with finite dimensional function class 


We begin with the linear function class: 


Q° (x, u) = OT b(x, u) (5.64) 


While this is required for convergence theory, any of the algorithms to come can be applied using 
a nonlinear function approximation architecture. 
An approximation of the inequality constraint in (5.63) is given by I*(0) < 0, where 


1 N-1 

re(9) = sim iF S- [Pi] (5.65a) 
k=0 

Dey1 (8) = —Q°(a(k), u(k)) + cx + Q9(a(k + 1)) (5.65b) 


and [z|_ = max(0,—z) for any z € R, and cy = c(x(k), u(k)). Subject to (5.64) the function P* is 
convex, so that {0 : T®(0) < 0} is a convex subset of R?. 
This motivates the first algorithm: 


Convex Q-Learning 


Choose a pmf p on X x U, a convex regularizer Ry (A), tolerance Tol > 0, and solve 


6* =argmin — (u,Q°) + Ry(0) (5.66a) 
6 
ie 
def 
s.t. =F x Dp4(0)|_ < Tol (5.66b) 
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The choice of regularizer might be based on computational efficiency. The introduction of the 
tolerance “Tol” is also in anticipation of computational challenges to be discussed shortly. 

The next algorithm is inspired by the batch architecture of DQN: choose intermediate times 
To =0<T, < In <--- < Tp_1 < Tg = N, and a sequence of regularizers: Rn(Q, 0) is a convex 
functional of Q,0, that may depend on 6,. Examples are provided below. 


Batch Convex Q-Learning #1 
With 0) € R@ given, define recursively 
nex =argmin {—(u,Q") + Rn(Q",0)} (5.67) 
ets. 10) < Vol (5.67b) 


where for 0 <n < B—1, with rz = Tp41 — Th, 


Rigas 
re@=— S> [Pin] (5.67¢) 
”  k=Tn 


It is expected that ?,, will be designed to avoid discarding previous data, such as 


Rn(Q®, 0) = R9(0) + \|6 — On|? (5.68) 


11 
: Bait 


in which R® is convex and {,,} are positive scalars. 
The next variant is inspired by a primal-dual algorithm. Recall that [z]_ = max(0, z) for z € R. 


Batch Convex Q-Learning #2 


With 09 € R@ and Ag > 0 given, and a step-size sequence {a}, define recursively, 


One = arg min {—(1t, Q%) + An{ls(8) — Tol] + Rn(Q*, 6) } (5.69a) 


Anti = [An + On41(T (nti) — Tol)]+ (5.69b) 


The introduction of Tol > 0 (strictly positive) in (5.69) is crucial: [)(0n41) > 0 for any n by 
definition, so {A,,} is a non-decreasing sequence if Tol = 0. 

This formulation is best suited to approximation as a QSA recursion, as in the batch Q(0)- 
learning algorithm (5.61). Consider the choice of regularizer (5.68), with R® convex for each n. 
The first order condition for optimality in (5.69a) results in the fixed point equation: 


0 = Vo{—(n,Q%) + Anlli(8) — Tol] + R20) }| Gia 0) 


6=6n41 Bn4i 


This motivates a primal-dual algorithm: 
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Batch Convex Q-Learning #3 


With 09 € R@ and Ag > 0 given, define recursively, 


Onst = Om — Bua Vo{ —(H,Q") + AnlTs(8) — Tol] + RH()}|, (5.70a) 
An+1 = [An + Anti (Te (On41) _ Tol)]+ (5.70b) 


where {Qn Bn} satisfy (5.57). 


Any of these algorithms can be improved through attention to the concepts in Section 4.5. In 
particular, convergence might be accelerated using PJR averaging. 


Numerical instability with fast sampling. Real life control problems typically involve a 
system operating in continuous time. Care must be taken with sampling, because the temporal 
difference may not be very informative. 


This challenge arises in every algorithm based on temporal differences, but is most pronounced 
with convex Q-learning. It is best explained through an example. 


Example 5.5.1. Convex Q for Mountain Car 


The system equations (2.58) for Mountain Car represent an Euler approximation for the ODE 
(2.56), with sampling interval A = th41 — ty, = 1073, so that a(k) © az, with ty = k x 107%, and 
where {z;} is the solution to the ODE. In applying convex Q-learning with the basis (5.11), it is 
observed that D?.; = cn for many values of n, where 


Diu = —Q” (x(n), u(n)) + en + Q™ (x(n + 1)) 


The desired constraint D?,,; => 0 is vacuous in this case. This is purely an artifact of fast sampling, 
resulting in very small values ||x(n + 1) — x(n)]]. 

We can choose to increase the sampling interval, or we can adopt state dependent sampling. 
Suppose that we are given data {x(k)} that is obtained from sampling at a very quick rate. One 
approach to sub-sampling is through binning X = (i) e4 B;, (a disjoint union), and choose sampling 
times {7} so that adjacent sampled states lie in distinct bins: Bin(a(7,%41)) 4 Bin(x(7,)) for all k. 

Here is one successful approach: choose an upper limit 7, take 79 = 0, and for k > 0, 


Tk+1 = min{T, +7, Teait ; Te = min{j > 7 +1: Bin(x(j)) 4 Bin(x(7,))} 
To apply any of the algorithms in this chapter we assume the input takes a constant value on the 
interval [7,%,7%+41) for each k, and introduce the cumulative cost, 


Tk+1—1 


Cr, = 2 c(x(j), u(TK)) 


J=Tk 


For Mountain Car this becomes C,, = T,41 — Tp if the goal is not reached before time 741. The 
temporal difference sequence is then redefined: 


net = —Q""(2(Tn), U(Tn)) + Cr, + Q% (x(n + 1) 
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An alternative is to apply one of the specialized algorithms for continuous time data described in 
Section 5.6. : 


5.5.2 BCQL and kernel methods 


The motivation for batch methods in this book is because they are currently popular with prac- 
titioners, and also in anticipation of kernel methods. In this setting we no longer have a fixed d 
dimensional basis. Rather, due to Thm. 5.1 (the Representer Theorem), an effective basis emerges 
from the observed samples in which the dimension of 6 is equal to the number of observations NV. 
Even in simple examples, the value of N for a reliable estimate may be larger than one million. 

Suppose that H is defined based on a RKHS. BCQL #2 is easily adapted to this setting with 
a change in notation. In particular, we redefine the loss function (5.67c): 


Tn41—-1 
KQ=— DY Pin@I_ 
© Ree, 


Dy1(Q) = —Q(a(k), wlk)) + c(a(k), uk) +Q@(k+1)), QEH 


Kernel Batch Convex Q-Learning 


With 6) € R@ and Ao > 0 given, define recursively, 


Qn! =argmin {—(H,Q) + Anlla(Q) — Tol] + Rn(Q)} (5.71a) 
QEeH 
Anti = [An + Bnti(l(Qn4i) — Tol)]+ (5.71b) 


A candidate regularizer similar to (5.68) is the quadratic 
1 
Rn(Q) = 3——llQ - Q" lit 
An+1 


However, this choice presents a challenge: to apply the Representer Theorem, we must make a 
change of variables h = Q — Q”, and our BCQL algorithm will produce the optimizer h”*! € H. 
The Representer Theorem tells us that the function h”*+(-,-) is a linear combination of the 
functions {k((a(k), u(k)),(-,-))} where & ranges over the nth batch (of size r,). The Q-function 
approximation is the sum Q”+! = Q” + h”*1, and so by induction 


rea aaa 


This estimate is entirely too complex: each h’ depends on r; observations. 
A regularizer that avoids this complexity is defined by the sum of two quadratics 


Rn(Q) = —— HQ Q"I2 + 51013, (5.72) 
n+1 
where Q” € H is the estimate at stage n, and 
Tnii-l 
1 
1Q-Q"k=— SP (Q(a(k), u(k)) — Q"(a(k), u(k)))” 
" k=Tn 
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With this choice of regularizer, the Representer Theorem tells us that for each n the optimizer 
takes the following form, for some 0™* € R”, 


Tr 
et) = » Ok (24, 2) , = (90); 
i=1 
where {z; = (;, u;)} are the state-input pairs observed on the time interval {T,_1 < k < T,}. 


5.6 Q-Learning in Continuous Time* 


Recall from Section 3.8 the HJB equation 0 = min,,{c(x,u)+VJ* (x)-f(x,u)}. The term in brackets 
doesn’t lead to a useful definition of the Q-function. Instead, we fix a scalar o > 0 and denote 


H*(a,u) & {c(a, u) + VI* (x) - f(x, u)} + oJ* (2) 


A similar construction was introduced in Section 4.5.3 using o = 1. The addition of a function of 
x doesn’t change the minimizer, so that whenever the conditions of Thm. 3.12 hold, 


*(x) = arg min H* (2, u) 


The identity H*(x) = min, H*(x,u) = oJ*(x) follows from the HJB equation, which on sub- 
stituting into the definition of H* gives 


H*(x,u) = c(x,u) + 0 'VH* (2) - f(x, u) + H*(2) (5.73) 


The component involving the model can be eliminated using the chain rule: 4 H* (xt) = 0H" (s;) fae, 
and then substituting the dynamics fat = f(x, uz) gives for any input-state trajectory 


4 H* (zt) = V H* (i) : Tate, ut) 


We can follow the same steps as in discrete time: substituting the derivative formula into (5.73) 
gives the sample path representation 


H* (xt, ut) = c(xe, uz) + 0 | 4 H*(x;) + H*(2t) 


This is valid for any input and resulting state satisfying fe, = f(z, uz) for all t. 
Rearranging terms, H*(2;) is expressed as the output of a first-order, stable linear system: 


4 H* (xt) = —oH* (xz) + ol (5.74) 
with “input” U4 = H* (x4, uz) — c(xz, uz), whose solution is 
t 
Posen Gre | AE Gu) — eeu, de (5.75) 
0 
It is convenient to use compact notation for the ‘smoothed’ processes on the right hand side: 


t t 
Hz = | cw edt = o | eee, tr) dr (5.76) 
(0) 0 
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Each satisfies a first order differential equation: 


fut => o[H* A* (x, 4 )| Ct = —o|Cy = Gus ut)| 


Given a parameterized family of approximations {H® : 6 € R“} we define H° and H? as above, and 
then duplicate any of the algorithms proposed for discrete time. 

Two batch algorithms are introduced in the following. Either can be replaced with a QSA ODE 
similar to (5.61). The two algorithms are distinguished by the goal. In the first and simpler case 
we are considering the standard total cost problem, and assume we are observing (21, uz) for a 
long period of time. In the second algorithm we are using re-start, which is well-motivated in path 
finding problems as in the Mountain Car example. 

If t is large, then it is reasonable to neglect the term e~°H®(xq) in (5.75), and define the 
temporal difference as follows: 


D,(0) = H° (ae) — [Hy — Ci] (5.77) 
This is a concave function of 6 if the parameterization is linear. 
We arrive at a generalization of (5.66) to choose 6*: 
Convex Q-Learning 


Choose a pmf on X x U, a time horizon [T7o, T], a convex regularizer R(0), tolerance Tol > 0, and 
solve 


é*=argmin — (u,H°) + R(6) (5.78a) 
9 


s.t. [ roy dt < Tol (5.78b) 
To 


Consider next a batch setting based on independent trials (as in Mountain Car, in which we 
reach the goal at time T;,, and then re-initialize to a state x7,). For simplicity we simply state a 
translation of DQN, in which one appearance of the parameter is fixed at the previous value @p. 
Define for t > Th, 


t 
DP (0) = H" (x4) — ee %-™ A (ap) — 0 | e 7) 9 (2, ur) — C(@z, Uz) dT (5.79) 


This is used to obtain a loss function T analogous to (5.58b). 


DQN 
With 09 € R® given, along with a sequence of positive scalars {a,,}, define recursively, 
1 
b= arg min{ T; (0) papeeeiee | On||?} (5.80a) 
0 An+1 
where for each n: 
al Eel: 2 
re(o) = 4+ i (DO) 2dt, with rm = Tay —Th. (5.80b) 
Tm 
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These algorithms are especially practical for a linear parameterization H? = 67, since for 
example H? = 67 Y%, where % is obtained by filtering: 


t 
yp, = a e Nab (a, Ur) dt 
0 


5.7 Duality* 


The ideas introduced here are based on an entirely different view of basic dynamic programming 
from Section 3.1. The main conclusion is an interesting dual for the DPLP (3.36). 

This section might be regarded as a preview of techniques used in stochastic control and RL 
for Markovian models. 

To keep notation simple, let’s assume that X and U are finite. There are no special input 
constraints: U(«#) = U for each « 4 x®, but we require u(k) = u® when 2(k) = x°, so that 
U(a*) = {ue}. The equilibrium satisfies by definition «© = F(x°,u°). The cost is non-negative, 
With, Ge") = 0; 

The variables in the LP consist of non-negative functions on X x U. Justification requires a 
different perspective on optimal control. For any input sequence u we define what is known as the 
(conditional) occupancy pmf: 

[oe] 
@(a,u| xo) =) _1{2(k) =2, u(k) =u}, (z,usEXxU, «7° 
k=0 


where x is the state process obtained with initial condition x9 and input u. For notational conve- 
nience we set @(x°,u | 2) = 0 for any u. A bit of accounting gives 


oe) 


S/ c(a, u@(c, w| 29) =JS(ag) = > cla(k),ulk)), z(0) =x 
zu k=0 


Part of this accounting includes these assumptions: c(x°, u°) = 0 and U(#*) = {u°}. 
We then define for a given pmf v on X, 


This has a very different interpretation than the steady state probability measure introduced in 
(5.22). The function @: X x U > Ry, is an example of a feasible variable for the LP to be 
constructed, with objective function 


(@,c) = - e(x, u)@(a, u) =(V,d) 


ru 


The constraints in the LP require further effort. For any occupancy pmf and each x denote 
@x(x) = S° @(2,u) 
U 


and whenever @ x(x) > 0 we denote 


b(u| x2) = = @(x,u) (5.81) 
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In stochastic control this is viewed as a randomized policy: (u | x) is the probability that u(k) = u 
when x(k) = xz. More notation from stochastic control is a mapping “P”: for any occupation 
measure @, let @{ = @P be defined by 


ax(z)= $0 a ,w){F@ uw) =2} 
a7 yuT 
The representation @, = v + @P provides the constraint (5.82b) in the following LP: 


min (@,c) (5.82a) 


st. > @(2,u)=v(x)+ YS) @(@,w )L{F(a,u-) =a}, oA 2° (5.82b) 


YS" a@(2°,u) = 0 (5.82c) 


@(r,u)>0, (#4,u)EeXxU (5.82d) 


Proposition 5.11. If J* is finite valued, then the following hold: 
(i) If @ satisfies eqs. (5.82b) to (5.82d) then (@,c) > (v, J*). 
(ii) The LP (5.82) has a solution @* satisfying (@*,c) = (v, J*). 


(iii) If b* is an optimal policy, then one optimal solution admits the decomposition 
@*(x,u) = 1{*(x) = uf a@;(z) (5.83) 


Proof. The proof of the proposition requires material from Part II. In particular, it requires famil- 
iarity with probability concepts beyond what was required up to here. If you have background in 
probability theory, then read on. 

Parts (ii) and (iii) are by construction: choose @* to be the occupancy pmf associated with 
*. We move on to (i): For any feasible @ we design an input sequence defined as a randomized 
function of the state using (5.81): 


P{u(k) =u | 2(0),...,2(k)} = (u | x(k) 


We have @x(x°) = 0 by assumption, which is not a problem since we already impose u(k) = u° 
whenever z(k) = x®. The policy is not defined for other states x satisfying @,(2) = 0, which is 
not a problem for a different reason: such states are never visited when this policy is used and the 
initial condition satisfies v(x9) > 0. 

We then define, for any initial condition satisfying v(xo) > 0, 


J* (xo) =E|S > e(a(k),u(k))], — @(0) = a0 
k=0 


The proof of (i) is completed on establishing that (@,c) = (v, J®), and that J° (2x9) = J* (aq) for 
any Zo. O 
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Every linear program has a dual. Here it is simplest to construct the dual following the steps 
in Section 4.4.4. The dual variable is a function A: X > R associated with the equality constraint 
(5.82b) and the dual function is defined as follows: y*(A) = 


min (2,0) + Ma){— > @(z,u) + v(2) + 3 @(e~,u-){ F(a", un) = a}} (5.84a) 


st. @(z,u)>0, (2, u)EXxU, @x(x°) = 0 (5.84b) 


We impose the constraint A(«°) = 0 since (5.82b) is imposed only for « 4 x°. 
The dual LP is by definition the maximum of y* over all A. The main conclusion is that this 
dual is precisely the DPLP: 


Proposition 5.12. The dual function admits the representation 


(5.85) 
—0oo else 


*() nee if c(x,u)— Aa) +XAF(a,u))>0 forall x,u, «42° 
Y — 
If J* is finite valued, then p*(A) < (v, J*) for any X, and this bound is achieved using A* = J*. 
Proof. Using this identity: 
S> Aa) 1{F (x7, u7) = 2} = A(F(@7, u7)) 
it follows that the objective can be expressed 
eg lAj= min (@,c— A+ PA) + (v,A) 
subject to the constraints (5.84b), and where the function \~ = P is defined by 
N(x, u) = (F(x, u)) 
This implies (5.85). The bound y*(A) < (v,J*) holds because the constraint (5.82b) has been 


relaxed in (5.84) (this bound is known as weak duality). The fact that this upper bound is achieved 
using A* = J* is immediate from the dynamic programming equation that J* satisfies. O 


5.8 Exercises 


5.1 Over fitting. Consider the curve fitting problem without noise: y; = H*(z;) in which H*(z) = 
2? for z € R, and evenly spaced inputs {z; = i/n : 1 < i < 2n}. Calculate the estimate h* using 
RKHS function approximation, based on the solution (5.18). Plot h* for various values of X (be 
sure to include \ = 0). Also experiment with n and choice of kernel. 


5.2 For the linear system (2.13a) with scalar-input, consider 


u(k) = —Ka(k) + E(k) (5.86) 
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The gain is chosen so that the eigenvalues of F' — GK lie within the open unit disk in C, and the 
probing signal is a mixture of sinusoids: 


E(k) = 5° a; sin(27[¢; + wik]) 


=! 


The question to be considered: how large must we take p to ensure sufficient exploration? To 
estimate d parameters you might answer p = d, but that intuition may fail because of nonlinearities 
in our “observations”. 


Consider a special case in which the input and state are scalar valued, and denote Wx) = (x(k)?, u(k)?, x(k)u(k))T. 
This regression vector might be used in TD-learning for a scalar linear system. Denote 


;7 
Re = in wy 2. POMy 
k=0 


Investigate the consequences of rank deficiency, and see if you can construct an example for which 
this matrix is full rank using p = 1 or 2. 


It is essential to approximate: 


x(k) = Sa? sin(2m[d? + wPk]) + €() 


i=1 


where e(k) converges to zero geometrically quickly, and {a?, $7 ,w?} are obtained by reviewing your 
lecture notes on Signals & Systems! You can ignore the vanishing term ¢(k) in your computation 
of RY. 


5.3 Obtain an expression for A(#) = —V2Iy (0) for the batch loss function (5.36b). Obtain 
conditions on {x*(0), u’(0)} so that A is Hurwitz. For this it will be helpful to represent —A as the 
sum of M positive semidefinite matrices. 


5.4 The ODE approximation for TD()) is linear i9 = A(8— 6*), provided the function approxi- 
mation architecture is linear (recall (5.48)). With Q® = 6™w, we have 


A= Ea [¢x[—v(2(k), u(k)) + ¥(a(k + 1), o(a(k + 1)" 


The existence of a steady-state @ can be justified by considering an implementation with restart. 


In this exercise you will consider \ = 0 so that ¢, = ¥x) = W(a(k), u(k)), and consider the on-policy 
setting for which we may write 


A=Ea[—daydhy + YonPiersy! 


To simplify the calculations that follow, assume that Ea [Day V(a) = I, and also assume the non- 
degeneracy condition: Eo [{9T (a(41) = WK) $7] > 0 whenever 6 # 0. 
Show that the matrix A is Hurwitz under these assumptions. 


Suggested approach: let (A, v) denote an eigenvalue-eigenvector pair, with ||v|| = 1 (and remember 
that \ and v may be complex). You then have, by the eigenvector property, 


viAv = dr\ulv =X 
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where “{” denotes complex conjugate transpose. Consider next the result of our simplification: 


A=-I+ Ea [bayer] 
N= vl Av = -14+ Eo[(u' day) (ov Yar))| 


The right hand side is —1 + (v'¢x), U'Y(e41))@ in the notation introduced in Section 5.4.1. Look 
up the Cauchy-Schwartz inequality to see how to bound the inner product, and conclude that the 
real part of A is strictly negative. 


5.5 Formulate TD(A)-learning for the discounted-cost criterion, with fixed-policy Q-function de- 
fined by 


Q°(z,u) = S > 7*e(x(k), u(k)) x(0)=2, u(0) =u and u(k) = o(2(k)) for k > 1. 
k=0 


The key step is to obtain a new definition for the temporal difference. 


Repeat Exercise 5.4 for the on-policy version of your algorithm. 


5.6 TD-learning for the inverted pendulum. Review Exercise 3.11, and see if you can find inspira- 
tion to obtain a basis for TD- or Q-learning with cost function c(x,u) = 0? + u?. 


Apply TD-learning with a linear function class to estimate the total cost value function for your 
favorite policy } from Exercise 3.11. Does the policy improvement step based on your estimate of 
Q® provide a sensible policy? 


5.7 TD-learning for MagBall. Take your favorite policy for MagBall, such as the nonlinear policy 
(2.74) (for suitable values of K and K3). Apply TD-learning with a linear function class to estimate 
Q? with c(x,u) = 9? + u? (see Exercise 4.14). 

Use your intuition to devise a basis. For example, you expect the value function to grow very 
quickly when the position is near the magnet, or very far away. 


Does the policy improvement step based on your estimate of Q® provide a sensible policy? 


5.8 (Return to rowing). This and many other exercises in this chapter might be assigned over 
several weeks, and split into two parts: 

Part 1: Prepare a proposal on how you would tackle the rowing problem using either approximate 
policy iteration or a version of Q-learning. Give rationale for your choice of algorithm. Give full 
details regarding exploration, step size, etc (and explain your choices). Predict what might go 
wrong, and how you would modify your design. 


Don’t forget that this is a (cooperative) game. For simplicity, you might convert this into a “two 
player game” as follows: isolate one rower as “player 1”, and then the collection 100 rowers that 
model “player 2”. In a best response strategy, player 1 learns the best policy ” in response to the 
remaining 100 rowers playing policy "71. 

Part 2: Experiment with you algorithm with a finite population of rowers. Make sure you choose 
diverse initial conditions for the population in each round. 


5.9 Conver Q with binning. If the input space is not too large, and binning of the state space is 
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feasible, then there is a more intuitive relaxation of the DPLP: 


0* =argmax {(p, Q°) st. D'“(6)>0 for eachl<i<dandue U} 
6 


oe 
N 


BH 


: [—Q"(w(k), u(k)) + e(a(k), u(k)) + Q*(a(k + 1), u)|1{a(k) € Bi} 
k=0 


with De“ (6) = 


Observe that D’“(@) is an affine function of @ when {Q°} is defined using a linear function approx- 
imation (not necessarily using binning). 


Try this out with the Mountain Car example, taking note of the warnings in Example 5.5.1. 
5.10 ERM and kernel methods. This exercise is intended to improve understanding of both kernel 


methods and batch algorithms. The integer r,, > 1 denotes the batch size, using the same inter- 
pretation as in DQN (see (5.58)), and the batch of data at stage n is denoted {z? :1<i<rp}, 


Consider the recursive algorithm 
h = mini? ye Alh—h 24 §ilAll2 
n+1 arg m1 { n( ) wall alln I laut 


| — all? = (A022) — wre)? 


es 
with {a,,} a sequence of positive scalars, and the objective has the form 


Tr(h) = M(A(zq),--- her) 


Tm 


The Representer Theorem tells us that hn+1(z) = 30 6%, (a)k(z, 2) for an r,;-dimensional vector 
On+1: 
Obtain a recursive formula for {0% : n > 1} for the curve fitting problem, in which T,, is the 


quadratic function 
Tl 


Tah) = Soy — a(eP)? 


i=1 
with {y!'} scalars. Your solution will be similar to (5.18). 


5.11 The loss function eq. (5.58a) in DQN simplifies computation, but unfortunately does not 
address the many challenges with Q(A)-learning. Consider the alternative: 


Tn4i-l 
F5(0) =~ > [-Q%(a(h),u(k)) + elk), u(k)) +O (alk +1))]? (5.87) 
" k=Tn 


A 
in which Q (x) = Q(z) +VoQ" (x): [9 — On]. For a linear function approximation, a sub-gradient 
is given by 


Vo(e) = Veen] ae 


This is a true gradient whenever the minimizer of Q® (a, u) over u is unique. 
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(a) Obtain a version of Prop. 5.9 for this algorithm (a fixed point equation for 0,41). Based on 
this, propose a recursive algorithm similar to batch Q(0)-learning. 

(b) Obtain a conjecture on how Prop. 5.10 should be modified using the new definition (5.87), 
and obtain an interpretation for a limit point of the algorithm. Feel free to assume a linear 
parametrization. 


(c) Review Exercise 5.10, and consider the loss function defined on a RKHS: for h € H, 


Tr4i-l 
re(h) = D> [-hla(k),u(k)) + o(@(h), u(h)) + fi (o(& + 1))]? 
" k=Th 


and the associated batch Q-learning algorithm 


, 1 
hn = argmin{l,(h) + +— alla halle + S{lAllz} 


Obtain a recursive representation for the optimizers {h,, :n > 1} (similar to Exercise 5.10). 


5.9 Notes 


This chapter also covers a great deal of ground. These notes provide a few missing ingredients, and 
offer sources for further reading. 


Machine learning The vast literature on function approximation, model selection and over- 
fitting is an indication of its importance. 

There are many good references on machine learning. The Elements of statistical learning 
[156] and Murphy’s Machine learning: a probabilistic perspective [267| are great introductions, 
and MacKay’s survey [234] is fast-paced but also good. References [262, 290] contain accessible 
treatments of kernels with statistical interpretations (and much more). See also the classics [81, 
97, 365] for ways of quantifying model complexity and over-fitting from various perspectives. 


TD-learning A full history of temporal difference methods can be found in the second edition 
of the monograph of Sutton and Barto [338]. While these authors are considered major stars in 
RL and in particular TD methods, they are the first to point out an older history going back to 
Shannon in the 1950s. The TD(0) algorithm was introduced by Witten in the 1970s [377]. See 
chapter 1 of [338] for a scholarly and entertaining survey of the origins of the field. 

The promise of TD-learning took a leap following Sutton’s dissertation [339, 340] (based in part 
on collaboration with Charles Anderson, and his advisor Andrew Barto [342, 26]). This early work 
contains substantial intuition regarding TD(A) algorithms and their application, with emphasis on 
neural-network approximation architecture for applications, and linear function approximation for 
much of the theory. The end of the decade was crowned by Watkins’ dissertation that introduced 
the Q-learning algorithm [372, 371, 346]. 

It is surprising that there was so little synergy in the 1980s and ’90s between RL and adaptive 
control [145, 84, 202, 242]. There was significant outreach in one direction: notable examples include 
the surveys by Sutton et. al.: Learning and Sequential Decision Making [27], and Reinforcement 
learning is direct adaptive optimal control [343]. RL pioneers in the control systems research 
community include Frank Lewis [183, 222] who remains a leader at the intersection of RL and 
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control systems, and John Tsitsiklis whose contributions are surveyed in Sections 9.11 and 10.10. 
An early example of off-policy TD learning is [75], which treats the LQR problem and is written 
for a control theory audience. 

The TD-learning algorithm (5.34) is more similar to the (off-policy) LSQ algorithm of [206] 
then the original LSTD algorithm of [76]. The common acronym “LSTD” is used for both on- and 
off-policy implementations throughout the book. 

The objective function [ in (5.33) is known as the mean-square Bellman residual in Baird’s 1995 
paper [21]. It is pointed out there that the resulting SA approximation of the gradient flow is slow 
to converge. The paper goes on to investigate compromises between TD(0) and gradient descent to 
accelerate convergence. This paper also introduced the famous counterexample based on Fig. 5.5. 
Around the same time, Gordon [146] presented examples of unstable function approximations for 
Q-learning. 

The theoretical foundations of temporal difference methods took a significant leap one decade 
after Sutton’s dissertation, following the realization that TD-learning can be cast as a stochastic 
approximation algorithm [352, 169]. This sparked Tsitsiklis’ research program on reinforcement 
learning that has had enormous impact on the field, and on my own appreciation of the discipline. 
Two dissertations supervised by Tsitsiklis were truly ground-breaking in this domain: Ben Van 
Roy [363] and Vijaymohan Konda [188] provide elegant theory that explains the success (and 
potential failure) of TD-learning and actor-critic algorithms for stochastic control systems. Much 
of this theory is a foundation for material in Part II of this book. 

Kernel methods in RL have a significant history, with most of the algorithms designed to 
approximate the value function for a fixed policy, so that the function approximation problem can 
be cast in a least-squares setting [279, 131] (the latter proposes extensions to Q-learning). 

Left out of the book is any discussion on “safe RL” in which an attempt is made to enforce 
stability guarantees. The theory developed in this book should help you understand the literature, 
such as [284, 94], and help you to discover new techniques. 


Q learning First, why Q? The question of its origin was raised by Aaron Snoswell (then a gradu- 
ate student at Queensland University of Technology) during the early days of the Fall 2020 Simon’s 
program on reinforcement learning. Csaba Szepesvari contacted Chris Watkins, who responded 
shortly after: “... which letter to choose? I realised I hadn’t used Q, that enigmatic letter, and one 
could retro-fit ‘quality’ ...”. An enigmatic letter is fitting for this mysterious class of algorithms! 

The Q(A) algorithm (5.52) is not commonly seen in textbooks, except for the special case 
X = 0. Q(0)-learning and variants are commonly applied because they are the natural extension of 
Watkins’ Q-learning algorithm for controlled Markov chains. There is a firm theory for convergence 
of Q(0)-learning in the special case considered by Watkins, in part because it is assumed that the 
function class is “complete” (contains every possible function on X x U). Convergence of Watkins’ 
algorithm is established using ODE methods in Section 9.6. 

Major success stories in Deep Q-Learning (DQN) practice are described in [261, 260, 259], 
and [9] contains an insightful analysis. The limit theory for the algorithm (5.58) summarized in 
Prop. 5.10 is taken from [230]. The refinement of DQN introduced in Exercise 5.11 is inspired by 
the convex-concave optimization procedure of Hartman 1959 [155, 226]. Another approach to batch 
RL is based on empirical value iteration [153, 317, 174]. 

The origin of convex Q-learning is [246], where a family of LP approaches are formulated in 
continuous time. The challenge at the time was to find ways to create a reliable online algorithm. 
The algorithm challenges were addressed in [230, 247], which is the basis of much of Section 5.5. 
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For more on RL in continuous time consider reviewing the research program of Lewis [184, 359], 
the volume [360], and the recent survey [220]. 

Recall from Section 3.11 that Manne introduced the linear programming approach to dynamic 
programming in the 1960 article [238]. A significant program on linear programming approaches 
to approximate dynamic programming for Markovian models (MDPs) began in [314, 102, 103, 104] 
and continues today. 

The algorithms and theory in Section 5.5 are based on [230], which builds on the first version 
of Convex Q-learning [246] (designed for deterministic systems in continuous time). Improvements 
appeared in the following decade, primarily in a stochastic control setting. For example, the theory 
in [218] is based on a variant of the convex program (3.36), with the inequality in (3.36b) replaced 
by equality: it contains substantial theory for the tabular setting (recall (5.10)). More recent results 
in [28] obtain efficient algorithms based an LP formulation closely related to the DPLP (3.36). 
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Chapter 6 


Markov Chains 


We begin Part II with a return to the foundations of state space models, but now in a stochastic 
setting. The notation is essentially the same as in previous chapters, except that we use upper 
case to denote random variables: X = {X(0), X(1), X(2),...} denotes a sample path of the state 
process. For each time k > 0, it is assumed that X(k) takes values in a state space denoted X. As 
in previous chapters, the state space is a closed subset of IR” (recall (2.22)). We don’t rule out a 
finite state space, but there is no reason to be bound to this special case. 


6.1 Markov Models are State Space Models 


Recall from Section 2.3 that the state process in the nonlinear state space model is interpreted 
as a sufficient statistic. Consider the controlled model (2.6a), under a stationary Markov policy 
u(k) = b(a(k)) for k > 0. To compute future states x(j) for 7 > k we only need to know x(k); the 
prior history {x(2), u(i) : i < k} is irrelevant in determining future behavior. 

The definition of a Markov chain is formulated to capture the same memoryless property in a 
stochastic environment. An example is an i.i.d. (independent and identically distributed) sequence 
N, in which N(0) ~ 7 for a probability measure 71, and there is no memory: 


P{N(k) € S| N(0),...,N(kK—-1)]}=7(S), for eachk andS CX (6.1) 


These form a building block for more complex stochastic processes. 
Here are two definitions of a Markov chain. The first makes precise the definition of “memory- 
less” in a form milder than (6.1). 


(i) Memoryless property: The stochastic process X is a Markov chain if the following holds: 
For S C X, any time k, and initial X(0), 


P{X(k +1) € S| X(0),...,X(k)} = P{X(k +1) € S| X(k)} (6.2) 


(ii) Nonlinear state space model: A Markov chain is a stochastic process that evolves according 
to the nonlinear state space model, 


X(k +1) =F(X(k), N(k +1)) (6.3) 


where WN is i.i.d., and the initial condition X (0) is specified (if it is random, then it is assumed 
independent of the “disturbance” N). 


204 
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The first definition is slightly more general than the second, only because the dynamics defined by 
F do not depend upon time. We will restrict to such time-homogeneous Markov chains throughout 
the book. If an application demands a time-varying model, then we may resort to the state- 
augmentation tricks described in Section 3.3. 

The distribution of X(k) for k > 0 is defined by the initial distribution (the distribution of 
the potentially random X(0)), and the transition kernel. This defines the one-step transition 
probabilities, 

P(z, S) = P{X(k+1) € S| X(k) =z}, rEX, SCX. (6.4) 


It is called the transition matrix if the state space X is finite. For the nonlinear state space model 
(6.3), 
P(a,S) = P{X(1) € S | X(0) = x} = P{F(a, N(1)) € S} 


For j > 1, the j-step transition probability from x to S is denoted 
Pi(z, S) = P{X(k+j) € S| X(k) = 2} (6.5) 


In the majority of cases, we no longer think about equilibria «* € X when studying Markov 
chains. We seek instead an equilibrium measure 7 (more commonly called an invariant measure) 
satisfying the ergodic theorem: 


jim P¥(z, 8) =n(S), for anyxe€XandScxX (6.6) 
— 00 


We will see that this implies the existence of a steady-state: If X(0) ~ 7, then X is a stationary 
process (so that, in particular, X(k) ~ 7 for all k). This is interpreted as a stability property 
for the Markov chain, much like global asymptotic stability for deterministic nonlinear state space 
models. 


Notation and Conventions 


If you have taken a course with an introduction to measure theory then you know that care must 
be taken in defining the class of sets S for which (6.4) and the equations that follow are meaningful. 
For those of you with this background, keep in mind that X will be taken to be a Borel subset 
of Euclidean space (such as a closed subset), and any S C X will also be assumed to have this 
property. This is denoted S € 6(X). Examples of Borel sets in R include any interval [a, b], [a, 6), 
(a, b], or (a,b) with b > a, and any finite or countable union of intervals. Similarly, any function 
h: X — R is assumed to be Borel measurable. That is, the set S;,(r) = {a : h(x) < r} is a Borel 
set for each constant r. 

Never heard the word “Borel”? Don’t worry. Measure theory is not important for understanding 
any of the concepts that follow. The notation is required because transition kernels cannot in general 
be defined on every subset of the state space, except when the state space is finite or countably 
infinite. 


Integrals and expectations The symbols u, v and 7 are reserved for probability measures on 
B(X). We opt for the term probability mass function (pmf) when X is finite or countable. 

The state X(k) at time k is a random variable. Its distribution is a probability measure denoted 
ux, defined for any Borel measurable function h via 


E[h(X(h))] = / h(x) We(de) (6.7) 
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The notation may look strange to you, but I know of no better alternative when we don’t know 
the form of uz. If there is a density, then by definition u,(dx) = pz(x) dx for some non-negative 
function pz; if Uz is discrete that the integral becomes a sum. The Gaussian linear state space 
model discussed in Example 6.2.1 below makes clear why we need this abstract notation. 

Please keep in mind: we don’t know the form of uz. We don’t want to write one book for 
Markov chains with densities, another for those on a countable state space, and a third for when 
neither assumption applies (as found in Example 6.2.1). The notation (6.7) is agnostic to the form 
of Lp. 


Notation surrounding conditional expectations Conditional expectations of the form 


E[h(X(k + 1)) | X(0),...,X(&)] 


appear everywhere in the remainder of the chapter and the book (which is meaningful provided 
E||h(X(k + 1)|] < co). Never heard of conditional expectation? Exercise 6.3 is provided as a mini 
crash-course, with a longer crash-course in Section 9.2. You should review [154] or some other 
foundational source for a fuller understanding. 

Fortunately, definitions are simplified when considering functions of a Markov chain, because 
the conditional expectation can be expressed in terms of the transition kernel: for any x € X, and 
any initial distribution for X(0), 


E[h(X(k + 1)) | X(0),...,.X(k-1);X(k) =2] 
= E[h(X(k + 1)) | X(k) = a] 


= [ Ple.ay)rly 


If X is finite, of size m, then P is interpreted as an m x m matrix. In this case P(x9, 21) is the 
probability of moving from zo to x; in one time-step. The conditional expectation is expressed as 
a sum, 


(6.8) 


E[h(X(k + 1)) | X(k) =a] = D0 P(a,1)h(a1) 


xLEX 


Conditional expectations appear so frequently that we require shorthand notation: For a func- 
tion h: X > R and integers r,k > 0, 


le 


E,[h(X(k)))] = 
P¥h(x)S 


P E[A(X(r +k)) | X(r) =a] (6.9a) 
FE[A(X(r+k))|X(r)=a], 2 eX. (6.9b) 


TE 


In the special case k = 1 we write Ph rather than Pth. 

In (6.9b) we view P* as a mapping from functions to functions. This representation was 
introduced for finite state-space models early on in this book, just above eq. (3.17), where for 
emphasis we introduced the vector notation 


=> 


Be = |G). say tullgg) |) ASN Gis eee 


and wrote P*h rather than P*h. This cumbersome vector notation will not appear again in this 
book. 
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If is a probability measure on X, and X(r) ~ u (the state is distributed according to pu at 
time r), it then follows that X(k +r) ~ up, for any k > 1, with 


nee / u(dz)P*(e,8),  § € B(X). (6.10) 


This is expressed pz = uP*, which is also consistent with vector-matrix multiplication in the finite 
state space setting. This probability measure is precisely the same as used in (6.7) when r = 0. 
We could go further, writing 


uh = f n(x)n(de) 


This is consistent with multiplication of a column vector h by a row vector u to obtain a scalar (an 
inner product). However, “uh” can create ambiguity, so we introduce parentheses for emphasis: 


u(h) = f r(w)u(az) (6.11) 


Please get used to these notational conventions! They will be used throughout the remainder of 
the book. 


6.2 Simple Examples 


We begin with a few basic examples. The linear model is standard in physics, systems theory, 
economics, and many other areas. 


Example 6.2.1. The Linear State Space Model 


Suppose X = {X(k)} is a stochastic process for which there is an n x n matrix F' and an i.i.d. 
sequence N taking values in R” such that 


X(kK+1)=FX(kK)+N(K+1), £20 


where X(0) € R” is independent of N. Then X is called the (uncontrolled) linear state space 
model. It is precisely of the form (6.3) in which F(z,n) = Fa +n. 
We denote the state process using a lower case variable when N = 0, as in eq. (2.42): 


a(k+1) = Fa(k), k > 0. (6.12) 


The state process a is also Markovian: given the full history of observations {x(7) : i < k}, we can 
still predict x(k + 1) with exact accuracy, based only on knowledge of x(k). 
The transition kernel is easily described: For any « € X = R” and set S C X we have, 


P(z,S) = P{X(1) € S| X(0) =z} = P{F2+ N(1) € S} 
If in particular N(1) is Gaussian N(0, Uy), then P(z, -) is also a Gaussian distribution, but with 


mean F'x rather than zero. If at some time r > 0 we observe X(r) = 2, then for each 7 > 0, 


ij 
X(r+j)=Fict+ SF N(r +4) 
i=1 
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That is, conditioned on X(r) = x, the random vector X(r +7) has a Gaussian distribution with 
mean FY a and covariance 
j 
Ly =) ar 
i=1 
If ©x, is full rank (so its inverse exists), then P! admits a transition density: 


(x — : ex i Fiz) oy (y — For 
on (2n)ndet(Ex, ) I ae 


Pi(a, 8) = [ pj (x,y) dy 


where the j-step transition kernel is defined in (6.5). 
From this we obtain our first ergodic theorem: 


Proposition 6.1. Suppose that the eigenvalues of F lie in the open unit disk in C, and that N 
is U.i.d., with Gaussian marginal N(0,Xn). Consider any factorization, Uy = GG", with G an 
nx m matrix for some m > 1, and suppose that the rank condition holds: 


rank(C) =n, where C=[G|FG|---| F"'G] (6.13) 


Then, the following conclusions hold: 


(i) The steady-state covariance Sx,, has rank n, where 


Co 
Sy. = lim Xx. = F*)\Tyy FF 
Xoo = jim Bx, oe 


(ii) The density py exists fork >n, and converges as k + oo: for any x,y, 
1 
/ (2r)"det(=x,.) 


The limit defines the invariant measure: for S € B(X) 


jim pe(2, y) = Pooly) = exp( ayTExy) 
mS) = | polu) dy = | plu) PW.) ay = | PMy, 8) (ay) 


(iii) For functions h satisfying [ |h(y)|po(y)dy < co we have, for each initial condition 
X(0) =a, 


Jim Es (h(X())] = fim P&H (x) =n(h) = f pooly)h(v) dy 0 


In state space control theory, the matrix C in (6.13) is called the controllability matrix, and 
“x, the controllability Grammian. Part (i) is a consequence of the Cayley Hamilton Theorem 
(205, %y 77): 

An interesting special case is Uy = GGT, with G a column vector (“jy has rank 1). For any 
initial condition x, the probability measure P*(x,-) does not have a density for k <n. The 
proposition tells us that there is a density for k > n, provided C has rank n. 
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There is no density for any k when C is rank deficient. Moreover, since Ux,, is no longer 
invertible, the limits (ii) and (iii) are no longer valid. Exercise 6.10 is intended to explain how the 
proposition must be modified when Ux. is not full rank. 

Recall from Fig. 2.7 a comparison of the deterministic and stochastic models. The linear 
Gaussian model was chosen so that the assumptions of Prop. 6.1 are satisfied: the eigenvalues 
of F lie in the open unit disk in C, and the rank condition (6.13) holds. The covariance for the 
disturbance was chosen to be rank-deficient. This was through the choice N(k) = GW(k) with 
G € R? and W(k) scalar N(0,1), resulting in ©) = GGT. The rank condition (6.13) is 


rank([G | FG]) = 2 


Equivalently, G is not an eigenvector for F’. 7 


The Gaussian assumption is imposed only so that we can identify the invariant density. There is 
an ergodic steady state and some form of ergodicity, whenever |A(F’)| < 1 for every eigenvalue of F, 
regardless of the marginal distribution for the disturbance. In other words, the distribution of the 
noise is hardly relevant, so it is sufficient to consider the disturbance-free model with state process 
x to determine stability of this Markov chain (in the sense of the existence of a steady-state). We 
will see in Chapter 7 that generalizations of this conclusion hold for more general nonlinear models, 
and that one explanation for the solidarity of deterministic and stochastic models is found through 
Lyapunov theory. 

Random walks are defined by taking successive sums of independent and identically distributed 
(i.i.d.) random variables. 


Example 6.2.2. Random Walks 
Suppose that X = {X(k);k > 0} is a sequence of random variables defined by, 


X(k+1) = X(k) + N(k+1), k>0 


where X(0) € R is independent of N, and the sequence WN is i.i.d., taking values in R. Then X is 
called a random walk on R. The random walk is a special case of the one-dimensional linear state 
space model in which F = I. 

Suppose that the stochastic process X is defined by the recursion, 


X(k+1) = [X(k) + N(k+1)], 5 max(0,X(k) + N(k+1)), k>0, 


where again X(0) € R, and WN is an i.i.d. sequence of random variables taking values in R. Then 
X is called the reflected random walk. The reflected random walk is a special case of the one- 
dimensional nonlinear linear state space model in which F(x, d) = [x + d]+ for each z,d € R. 1 


The reflected random walk is a model for storage systems and queueing systems. For all such 
applications there are similar concerns: “we need to know whether a dam overflows, whether a queue 
ever empties, whether a computer network jams” [257]. We are also interested in finer questions 
regarding performance, such as the mean and variance of delay. 

One of the simplest reflected random walks is also the most famous model for a queue. 


Example 6.2.3. The M/M/1 queue 
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The transition function for the M/M/1 queue is defined as 


a ify=a+l1 


w ify=(e-1)s, one 


P(X(k +1) =y|X(k) =2) = Pl2,y) = 


where a denotes the arrival rate to the queue, pz is the service rate, and these parameters are 
normalized so that a+ y= 1. 


A A 
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Figure 6.1: The M/M/1 queue: In the stable case on the left we see that the process X(k) appears piecewise linear, 
with a relatively small high frequency ‘disturbance’. The process explodes linearly in the unstable case shown on the 
right. 


8 10 12 14 kx 10° 


Figure 6.2: A close-up of the trajectory shown on the left hand side of Fig. 6.1 with load p = 0.9 < 1. After a 
transient period, the queue length oscillates around its steady-state mean of 9. 


The parameter p = a/ is known as the load for the queue. If p < 1 then the arrival rate 
is strictly less than the service rate. In this case the process is ergodic: there is a pmf 7 on the 
non-negative integers such that for any initial queue length X(0) = x, and any integer m > 0, 


jim P,{X(k) =m} = n(m) 


The invariant pmf is geometric with parameter p, so that 7(m) = (1 — p)p”. The existence of 71 is 
interpreted as a form of stability for the queueing model, so that the sample path behavior looks 
like that shown in the left hand side of Fig. 6.1 and in Fig. 6.2. | 


6.3. Spectra and Ergodicity 


This section is devoted to the finite state space model: X is finite with m > 2 elements. In this case 
P is viewed as an m X m matrix, whose eigenvalues are the solution to the characteristic equation: 


det [AI — P] =0 (6.15) 


There are m solutions denoted {Aj,...,Am} (some may be repeated). By convention we set A; = 1. 
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Some immediate properties of eigenvalues: 


(i) Why is Ai = 1 an eigenvalue? Because there is always an eigenvector: define v! € R¢ to be 
the vector whose entries are all equal to one. Following the conventions above, the vector is 
also viewed as a function on X. The matrix-vector product Pu! is obtained using the usual 


definition: 
Po! (2) = > P(z,y)u'(y) = D> P(a,y) = 1 
y y 


That is, Pu! = v!. 


(ii) Less obvious is that there is a left eigenvector 7 with eigenvalue A; = 1 that has non-negative 
entries. This is normalized so that )°,.7(x) = 1. The eigenvector property is 


ny) = So n(x) P(a,y) 


x 


The pmf 7 is called invariant. 


(iii) Every eigenvalue must satisfy |A] < 1 (that is, A lies in the closed unit disk in the complex 
plane). To see this, consider iterating the equation Pu = Av to obtain 


Pega X'e., n>1 
Remember that the left hand side is a conditional expectation, so that 


Efv(X(n)) | X(0) = 2] = "v(2) 


The left hand side of this equation is bounded in n, which means that |A| < 1 as claimed. O 


The Markov chain is called ergodic if for each x,y € X, 
lim P{X(n) = y|X(0) =a} = lim P"(a,y) = n(y) (6.16) 


Consequently, for any function c: X > C, 


N—>Co 


lim Ele(X(n)) | X(0) = 2] = (ce) =D) ely) n(y) (6.17) 
y 


This leads to one more simple observation: 


(iv) Suppose there is an eigenvalue A # 1 but satisfying |A] = 1 (the eigenvalue lies on the 
boundary of the unit disk). Then the Markov chain cannot be ergodic. To see this, let v € C4 
be an eigenvector (non-zero), and write A = e/? with 0 < @ < 2m. Hence P"v = A”"v = ev 
for each n. With c = v we obtain 


E[e(X(n)) | X(0) = 2] = }) P*(z,y)o(y) =P” o(z) 
yEX 


The right hand side does not converge as n — oo. 


The following may be expected from the foregoing. 
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Theorem 6.2. (Spectral conditions for ergodicity) Suppose that A, = 1 is the only eigen- 
value satisfying |A| = 1, and this eigenvalue is not repeated. Then the chain is ergodic, and the 
convergence rate in (6.16) is geometric: 


1 
lim — log(max|P" (x, y) — n(y)|) = log(p) < 0 (6.18) 
noo nN xy 
where p = max{|Ag| : k > 2}. 


Proof. The first step is to consider a modified matrix P defined by 


P(z,y) = P(z,y) — my), r,y EX. (6.19) 


This can be expressed P=P-1® m, where 1 = v! is a column vector of ones, 7 is the invariant 
pmf, and “®” is an outer product. It can be shown by induction that 


P°=P"_1lenx (6.20) 


That is, for each x, y, 


P"(x,y) = P"(a,y) — my) 


With a bit more effort it can be shown that A(P) = {0,A2,.--,;Am}. That is, all of the eigenvalues 
of P coincide with those of P, except the first eigenvalue which is moved to the origin. A bit of 
linear algebra completes the proof of (6.18). O 


1 oe Eigenvalues  _______ Spectral Gap 


Transition Probability 


Figure 6.3: Communication diagram and eigenvalues for a four-state Markov chain. 


Example Matlab’s mcmix command is a convenient way to randomly generate transition matrices 
with given structure. Here is one example with m = 4 states, in which the five zeros in the transition 
matrix were imposed: 


0.0500 0 0.55386 0.3964 
0.9094 0.0500 0 0.0406 
a 0.1519 0 0.8481 0 ON 


0 0.1891 0.4302 0.3807 


The plot on the left in Fig. 6.3 shows the communication diagram for this Markov chain. This is a 
directed graph in which nodes correspond to the four states, and there is a directed edge between 
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states x and y if P(x,y) > 0. It is a generalization of the graph model for deterministic control 
systems introduced in Exercise 3.2. 

The eigenvalues of P are {Aj,...,A4} = {1, 0.5044, —0.0878 + 0.32957}, and illustrated at the 
right in Fig. 6.3. The spectral gap is defined by 1 — max{|A;| : 7 > 2} = 1— |Ag|. The plots were 
obtained using Matlab commands graphplot and eigplot. 

Matlab can be used to find the eigenvectors of the transpose of P: 


[V, L] = eig(P’) 


which gives L = {Aj,...,Aa}, and the four columns of V are the corresponding eigenvectors of PT. 
From this we obtain the invariant pmf, 7 = (0.1378, 0.0178, 0.7551, 0.0893]. 


! ! l ! ! l ! L 1 Pied 


2 4 6 8 10 12 14 16 18 20 71 


Figure 6.4: Rate of convergence of P” to 1® 7 for the four-state Markov chain. 


Thm. 6.2 states that P"(x,y) + m(y) at rate approximately AJ = p” for all x,y. Shown 
in Fig. 6.4 is a plot of log(|P"(x,y) — P"(y,y)|) for y = 4 and x & 4, along with a plot of 
log(A}) = nlog(p). Observe that the difference log|P" (x,y) — P"(y, y)| — nlog(p) is bounded in n 
(which is more than anticipated from (6.18)). This bound holds in this example because there is a 
single eigenvalue satisfying |Ag| = p. 


6.4 A Random Glance Ahead 


For the linear state space model, the M/M/1 queue, and for finite state space Markov chains, we 
identified conditions under which there is a limiting probability measure 7 satisfying (6.6): for each 
initial condition x € X and S € B(X), 


lim P* (zx, S) = n(S) 
k-00 
The limit 7t is invariant, in the sense that 


| n(dx)P*(x,8)=n(8),  S€B(X), b> 
x 


That is, 7P* = 7. Conditions for the existence of an invariant measure on an infinite state space 
will be surveyed later in the chapter, along with far stronger ergodic theorems. 

The existence of an invariant measure is equivalent to the existence of a steady state realization 
of the state process X. That is, the state process satisfies the stationarity property: 


P{X(k) € Ag, X(kK+1)€ Al,...,X(k +m) € An} is independent of k > 0, 
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for any m > 0 and any sequence of sets {Ag,..., Am}, each in B(X). Naturally, the stationary 
realization satisfies X(k) ~ 7 for each k. While we don’t expect to encounter true stationarity in 
the real world, this idealization is helpful for conceptualizing and analyzing algorithms. 


Critic methods The term “critic” in RL refers to a value function h, which is defined with 
respect to a one-step cost function c. One example is defined by total discounted-cost: for given 
CS7 <1, 


n(x) = Ex[S- ret X(6)] 
k=0 


TD(A) learning was introduced in the 1970s to obtain an approximate solution to the discounted- 
cost dynamic programming equation, based on a generalization of the temporal difference that is 
the core of TD-learning algorithms found in Part I. The stochastic setting gives new tools and 
insights. For example, it will be seen that the final output of the TD(1) algorithm solves the 
minimum norm problem: 

min{||h — All? :h € H} 
where H. is the function class within which we seek an approximation, and the norm is defined with 


respect to the steady-state: 


Ji — hI] =E[[A(X(k)) — A(X (k))P] when X(k) ~ (6.22) 


Actor methods One formulation of RL begins with a parameterized family {Pp : 6 € R7}, along 
with cost functions {cg : 6 € R“}. The “actor” methods are designed to find the parameter 6* € R¢ 
that minimizes the average cost ng = | co(x) M9(dx), with 7 invariant for Py. See Thm. 6.8 for 
one technique to estimate the gradient of 79, so that @* € R? can be estimated using stochastic 
gradient descent. Actor-Critic Methods make use of a combination of Thm. 6.8 and (6.22) to obtain 
model-free and unbiased estimates of the gradient. 


6.5 Poisson’s Equation 
This little equation will appear in many forms later in the book: 


c+Ph=h+n (6.23) 


It is known as Poisson’s equation. The function c is known as the forcing function, 7 is a constant, 
and the solution hf is called the relative value function. The conditions imposed to establish a 
solution to (6.23) also imply that an invariant measure 7 exists, and that 7 = n(c) = f c(x) n(dz). 
In many cases we obtain a solution by iteration or inversion: 


n= PX (6.24) 
k=0 


with C(x) = c(x) — 7 (one rationale for the name relative value function). The solution to (6.23) is 
not unique: if h is a solution, then we obtain a new solution by adding a constant. 
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The abstract notation in equations (6.23,6.24) is based on (6.9b). For a finite state space model, 
Poisson’s equation becomes 


e(z) + $> P(2,2')h(2')=A(a')+n, 2EX (6.25) 


An equivalent representation is reminiscent of the fixed policy dynamic programming equation 
(2.25), based on the Markov model (6.3): 


h(x) = (x) + E[h(F(x, N(k + 1))] 


This implies sample path formulae similar to those discussed in Section 2.5.2 and throughout 
Chapter 5: 
h(X(k)) = &(X(k)) + E[h(X(k + 1)) | X(0),... ,X()] 


Poisson’s equation plays a starring role in later chapters: 
(i) The relative value function of average-cost optimal control is the solution to a particular 
Poisson equation. 


(ii) Recall that Poisson’s inequality was the engine behind performance bounds in Section 2.4.3 
(i.e., bounds on the value function J). The analogous inequality is used in Section 6.6 to 
obtain bounds on the steady-state mean 7 as well as the relative value function h. 


(iii) The Central Limit Theorem (CLT) is used to obtain rates of convergence for algorithms in 
nearly every chapter. The covariance matrix appearing in the CLT has a representation in 
terms of the solution to (6.23), where the choice of c depends on the application. The first 
introduction to the CLT is in Section 6.7. 


(iv) As previewed in Section 6.4, in some applications to control we have a family of transition 
kernels { Pg : 6 € R42}. With c: X > R interpreted as a cost function, we would like to minimize 
the steady-state average cost 7 over all 6. Many approximations of gradient descent are based 
on Poisson’s equation, and rooted in the sensitivity theory surveyed in Section 6.8. 


(v) One application of the main result of Section 6.8 is to Actor-Critic Methods for RL, which 
is the topic of Chapter 10. 


The concepts in Section 6.3 provide conditions for existence of a solution to (6.23) when X 
is finite. The discussion in Section 3.2.3 provides an early introduction to the Perron-Frobenius 
theory behind Thm. 6.3. 


Theorem 6.3. (Spectral conditions for Poisson’s equation) Suppose X is finite, and the 
assumptions of Thm. 6.2 are satisfied. Then the following hold: 
(i) The function hy = v7.9 P*é is a solution to (6.23) (recall (6.24) ) 
(ii) Let s: X > R, be a function satisfying m(s) > 0, and v a pmf satisfying 
P(a,2') > s(x)v(2'), z,v' EX 
also expressed P > s®v. Then, a solution to Poisson’s equation is given by 


hg= Gayl, where Guy = eG —s@v)"=[I-(P-—s@v)] 


n=0 
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(iii) Let x © X denote any state for which n(x*) > 0. Another solution to Poisson’s equation 
is defined by the expectation: 


Tel 


a(x) = Ex] > e(X(&))| (6.26) 


k=0 
where T, is the first return time: 
Tj =] mink >is xX(k) =z} 
This solution satisfies h3(x*) = 0. 
(iv) If a function g and constant 6 soluec+ Pg=g+ 8, then8=n=T7(c), and 
g(x) — g(a") = hi(x) — hi(az") = ho(x) — ho(x*) = hg(a) for each x € X 


Proof. The simplest of the three is (i), since 
[oe] 
Pigs Spe 
k=0 
where we used P- P* = P*+!. The right hand side is equal to hy — é. 
The remainder of the proof is only a “roadmap”. For the full details see the Appendix of [254]. 
For parts (ii) and (iii), recall that the invariant pmf 7 exists and is unique, due to Thm. 6.2. 
Poisson’s equation combined with invariance of 7 implies the following: 


n(h) 2 S~ n(a)h(w) = $2 Yo a(x) P(w, y)h(y) = So w(x) {h(w) — ew) +n} = nh) — nlc) + 
A cy x 


Hence 7 = 71(c). 

The proof that hz solves Poisson’s equation follows arguments similar to Prop. 3.5. A full proof 
can be found in [257] or the Appendix of [254]. This solution is related to hj: we have P*é = P*é 
for any k, where P = P—1@7 was introduced in (6.19). Hence hy = Gy né. 

The third representation can be reduced to the second, starting with 


ha(x) =)" Ee [1{t. > k}e(X (k))] 
k=0 


It can be shown that this is hg in the special case s(x) = 1,-(x), and v(y) = P(2’, y). 
We finally come to (iv): if c+ Pg=g+ (6 andc+ Ph=h+yn, then let A=h —g, giving 


PA=A+n-8 
Using invariance of 7 gives 
mA) = So n(x)A(x) = $0 n(x) P(a, yA(y) = So nz) A(z) +n - 8 = n(A) + 0-8 
Be x,y x 


This gives 7 = 6, and also the eigenvector equation 


PA=A 


Under the assumptions of the theorem it follows that A(x) does not depend on z. O 
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6.6 Lyapunov Functions 


Much of the Lyapunov stability theory for deterministic dynamical systems can be adapted to the 
Markovian setting. Even for an irreducible Markov chain on a finite state space, for which the term 
“stability” may not be meaningful, Lyapunov functions are useful as a means to obtain performance 
bounds (generalizing the Comparison Theorem in Prop. 2.3). 

Poisson’s inequality for the Markovian model is the following extension of (2.31): for a function 
V:X— R4, a function c: X > R41, and a constant 7 < co, 


E|V(X(k+1)) | X(k) =a] < V(x) — c(x) +, cEX. 
In the more compact operator-theoretic notation this becomes 
PV <V-c+7 (6.27) 


As in the deterministic case, the function c is usually interpreted as a cost function on the state 
space. It is frequently assumed that c(x) is large for “large” x (recall the definition of coercive 
from Section 2.4.3). In this case, the Poisson inequality implies that V(X (k)) decreases on average 
whenever X(k) is large. This is illustrated for a deterministic system in Fig. 2.5, in which the set 
referred to in the caption is S = {x : c(x) < 7}. 


6.6.1 Average cost 


Let (x) denote the average cost, 


n—-1 
n(z) = limsup ~~ Ele(X(k)) | X(0) =a! 
k=0 


Using the operator-theoretic notation (6.9b) gives 


1 n—-1 
n(x) = limsup — ) P¥e(a). 
n—0o oo 


The dependency on x, and the use of “limit supremum” rather than limit in the definition of n(x), 
are both required because we are not imposing any particular structure on X. 

The average cost admits a simple bound under Poisson’s inequality (6.27). This is an extension 
of Prop. 2.3 to the Markovian setting. 


Proposition 6.4. Suppose that (6.27) holds with V > 0 everywhere. Then, the following transient 
bound holds for each n > 1, and each x € X: 


i 1 
= pk ee A= 
7 2 Pre(a) <4 Ve) 


Consequently, the average-cost admits the bound n(x) < 7. 
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Proof. Applying P to both sides gives P?>V < PV — Pc+ P7, and since 7 is constant, 
P°V < PV —Pe+79<V—c—Pe+27 


By repeated multiplication by P we conclude that, for any n, 


n—-1 
P°V <V+ng— >_ Pk 
k=0 
This gives the desired result on rearranging terms, and using the assumption that V > 0. O 


We have seen that the average-cost bound given in Prop. 6.4 is tight for a finite state space 
Markov chain, under the assumptions of Thm. 6.3: we can take V = h+const., where the constant 
is chosen large enough so that V > 0. 

The beauty of Lyapunov theory is that we can obtain performance bounds without knowing 
much about the model. This is illustrated in the next example. 


Example: The scalar linear state space model Consider the scalar model, 
X(k+1) =aX(k)+ N(kK+1), be 0, (6.28) 


where N i.i.d., with zero mean and finite second moment 0%, (not necessarily Gaussian). The cost 
function is the quadratic, c(x) = $2. 
Let V(x) = an with « > 0. We then have, 


PV (x) =E[V(X(k +1) | X(k) =2] 
prE[(ax + N(1))?] (6.29) 


= V(x) + $K(a? — 1)x? + gro 


Provided |a| < 1, we can set & = (1—a?)~+ in the definition of V to obtain a solution to Poisson’s 
equation with forcing function c, 


PV (x) = V(x) —c(x) +7, with m= 5(1- Go oy (6.30) 


It follows from Prop. 6.4 that n(a) < 7 for each x. In fact, the steps above show that 
E|c(X(k))] — 7 for each initial condition, so that we have equality: n(x) = 7. 

If N is Gaussian then P*(a, -) converges to a Gaussian N(0,02,) distribution as k > oo, with 
a2, =(1=a*) ‘0%. 


6.6.2 Discounted cost 


In the long run we are all dead. This quote is attributed to John Maynard Keynes, and is commonly 
used to justify the use of discounting in optimal control. I believe this is a mistake in most control 
applications. The rush to discount is often based on a false impression—that the average cost 
n(x) only reflects the cost “at infinity”. Lyapunov bounds and their consequences, such as the 
elementary bounds in (6.4), indicate that a policy with good steady-state behavior will also have 
good transient behavior. In particular, if the average cost is finite then there is a solution to (6.27), 
from which we obtain the transient bound presented in Prop. 6.4. 
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However, the discounted cost criterion is sometimes convenient because it is easier to analyze, 
and I am also forced to cover this performance criterion because it is preferred in many disciplines 
(in particular, operations research and economics). 

The discounted-cost value function was introduced briefly in Section 6.4: given a discount 
parameter y € (0,1), the discounted cost from initial condition x is defined as the weighted sum, 


h(x) = S > y*E[e(X(k)) | X(0) =a]. (6.31) 
k=0 
Once again this has the operator-theoretic form, 
hy So PX (6.32) 
k=0 


and from this we obtain a dynamic programming equation: 
hy =c+yPh,. (6.33) 


If c is non-negative valued then the lower bound h,(x) > c(a) holds, so that the discounted cost 
is unbounded whenever this is true of c. And, once again, we obtain a bound on h, under Poisson’s 
inequality. 


Proposition 6.5. If (6.27) holds with V > 0 everywhere, then h(x) < V(a)+7(1—y)~* for 
each x, and y € (0,1). 


Proof. The bound (6.27) gives 
yPV <V-—g+ I (6.34) 


where g = (l1—7)V + yc: a convex combination of the Lyapunov function and the cost function. 
Applying yP to each side gives, 


GPYV S9PV=9Pa+77 


and then using (6.34), 
GPYV aV =9=—9Pos 479 


As in the average-cost problem we obtain by induction, 


n—-1 n-1 
(yPy'V <V—-S Peg tay yt 
k=0 k=0 


From the definition g = (1—y)V + yc we obtain 


Do (Phe (a) + (1 = 9) PAV (2) < Ve) + (6.35) 
k=0 


Using the fact that V > 0 we can drop all but one term involving V on the left hand side of 
(6.35) to obtain, 


(1— y)V (2) + oP (x) <V(a) +7. 


k=0 a 
The bound hy < V + 7(1 — y)~ follows on subtracting (1 — y)V(x) from each side, and then 
dividing each side of the resulting inequality by y. O 
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Example 6.6.1. Discounted cost for the scalar linear state space model 


With cost function c(x) = 52, the solution to Poisson’s equation was obtained in (6.30). Prop. 6.5 


gives the bound h, < V + 7(1—)~', which in this case becomes 


1 
l-a 


1 2 2 
hy(x) < $— E +7 9h (6.36) 
We next compute h, to see if this bound is accurate. For this we begin with the dynamic 
programming equation (6.33). Let V(x) = A, + Fh ot, with A,,«. constants to be chosen. From 
(6.29) we have, 


PV <a?V + (1—07)Ay + gKyoN 
Scaling by 7 and adding c to each side gives, 
c+ yPV =c+7(a’V + (1—a*)Ay + Zk) 
Or, re-introducing the quadratic expressions, 
c(z) + yPV (a) = 5(1+ 707 Ky)a* + 7(GK oN + Ay) 


To solve the dynamic programming equation we require that the right hand side coincide with V. 
This requires that we match coefficients: 1 + ya?ky = Ky and (Skye + A,) = Ay, giving 


1 fi 
hy (a) = Ay + oKya? = 27 Snee 2? = T_ yon 


In particular, the bound (6.36) does hold. 


6.7 Simulation: Confidence Bounds & Control Variates 


In average-cost optimal control we are faced not with a single Markov chain, but an entire family: 
one Markov chain for each in a family of policies. We would like to estimate the average cost 
7 = 7(c) for many different policies so we can select the best in the family. Lyapunov functions can 
provide bounds, but it is rare to obtain bounds that are sufficiently tight so that we can determine 
which policy is optimal. Without alternatives, in the majority of cases we resort to simulation. 


6.7.1 Asymptotic statistics made finite 


We are blessed by the fact that the Law of Large Numbers (LLN) holds whenever there is an 
invariant probability measure 7. The Monte-Carlo estimate based on n samples is denoted 


{ n-1 
m= — Se X(k)) (6.37) 
The LLN tells us that this is asymptotically consistent: 


lim m= (6.38) 


noo 
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where the limit holds with probability one, for almost every (with respect to 7) initial condition 
X(0). In most cases the limit holds for every initial condition. 

The next question is the rate of convergence. For this it is most common to turn to the Central 
Limit Theorem, which tells us that 


lim P{ln, — | > r/V/n} = Plocrr|W| > r}, r>0 (6.39) 


where W is a standard Gaussian random variable (W ~ N(0,1)). The value o2,,. is called the 


asymptotic variance, and has several equivalent forms (subject to assumptions): 


Cie = Jim, nE[(n — n)’] (6.40a) 

= 3 R(k) (6.40b) 
k=—oo 

= 2n(éh) — n(é*) (6.40c) 


where C(x) = c(x) — 7 for any x, and for k > 0, 
R(k) = n(@P*2) = E,[@(X(0))e(X(k))], and R(—k) = R(k) 


This is the autocorrelation sequence for ¢(X(k)) for the stationary version of X. The function h 
in (6.40c) is the solution to Poisson’s equation (6.23) [257]. 

In many control applications, the random variables {X (i), X(i + k)} are positively correlated 
(at least for small &), which means that R(k) > 0. This is true for the M/M/1 queue (6.14), for 
example, and can be anticipated from the skip-free property of this Markov chain. For this reason, 
we usually expect o2,,, to be much greater than the ordinary variance, which is R(0). 

To apply the LLN and CLT for performance approximation it is necessary to estimate the 
asymptotic variance. Unfortunately, none of the three expressions above are useful for anything 
but analysis, or to inspire algorithms. The representation (6.40a) is inspiration for the batch means 
method: 


Batch Means Method. Perform M independent runs, each based on NV observations, to 


obtain M estimates, 
jet 
ny = Ti See), sea (6.41a) 
k=0 


Estimates of the mean and asymptotic variance are then defined by 


gee Ne 2 
TN = Vi ake: Gan ~ VV So (nv cm nN) (6.41b) 
i a=) 


See [133] for potentially more efficient estimators. 

The estimates in (6.41b) are used to obtain an approximate confidence bound: for given 6 > 0, 
choose r > 0 so that P{Gcrr|W| > r} = 6. Then, for large NV’, we have confidence that 7 lies in the 
interval [ny — r/VN, nv +r/VN], with probability approximately 6. If r/WN is not very small, 
then the runlength NV should be increased. 
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Figure 6.5: Histogram of M = 10° independent estimates of 6(15), with time horizon N = 10° for the Watkins 
algorithm applied to the 6-state example with discount factor y = 0.8. In each of the M experiments, the algorithm 
was initialized with 09 = 0. 


Care with histograms The procedure (6.41b) to estimate the asymptotic variance is only ef- 
fective if the variance is finite, and the initial conditions in the M independent trials are spaced 
widely apart. 

Fig. 6.5 is an illustration of what can go wrong. It is based on an example introduced in 
Section 9.7.2 for an RL algorithm rather than Monte-Carlo, but the procedure to estimate variance 
remains the same. 

With the time horizon of NM = 10° samples, one might expect that the algorithm has con- 
verged. Based on the histogram, many would believe that the estimates will converge to a value 
no greater than 100. The actual limit is nearly 500. In conclusion, this experiment has no value 
for understanding how many iterations are required for an accurate estimate. 

In this particular example it is known that the limit is positive, so it would make sense to 
sample the initial parameter uniformly on a widely spaced interval of the form [0, 7]. 

Section 6.7.4 contains an example for which the CLT is highly predictive, and other examples 
will follow when we come to RL (recall the discussion surrounding Fig. 1.3, or look ahead to 
Section 8.3 and Section 9.8.2). 


6.7.2. Asymptotic variance and mixing time 


The “mixing time” is informally defined as the number of iterations required for the Markov chain 
to approximately reach its steady state distribution. This might be formalized with the introduction 
of ¢ > 0 to quantify “approximately”, and let T(e) > 0 denote the minimal time for which 


|P"(z,A)-—7m(A)|<e¢, foralln>T(e), ACX 
This implies the following bound for functions f: X > R satisfying | f(a)| < 1 for all a: 
|Ex[f(X(n)) —(f)]| < 2e, for alln > T(e) 


This follows on observing that a function f maximizing the left hand side can be taken of the 
form f*(a) = 21,4 —1 for some A € B(X). In the finite state space setting we can take A = {y: 
P"(z,y) 2 My)}- 

Remaining in the finite state space setting, it is immediate from Thm. 6.2 and the surrounding 
discussion that the mixing time will be very large if there is an eigenvalue A of P satisfying A 4 1 
but |A| ~ 1. In particular, if A is on the unit circle, so that |A] = 1, then the mixing time is infinite. 
An example of this is a two state Markov chain with X = {1,2} and transition matrix P(i,7) = 1 


ifiFs: 
01 
rth 
1 0 
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The eigenvalues of P are Ay = 1 and Ag = —1. The Markov chain deterministically cycles between 
the two states; the second eigenvalue Az = e2"//7 with T = 2 reflects the period-2 dynamics. Hence 
the mixing time is infinite. What does this say about simulation? 

It is often taken for granted that a large mixing time implies that Monte-Carlo methods will 
take a long time to converge. This is clearly false in the two state example, since for any function 
c:X>R, 


n-1 
1 n if n is even 
th = — ) (X(k)) = 94, 
mr d nil + ai e(X(n)) else 


The asymptotic variance is zero since mp converges to zero at rate 1/n. The fast convergence is 
explained by the fact that this is an instance of quasi Monte-Carlo. 

Even in a highly volatile setting, the mixing time tells us little about the asymptotic variance. 
Prop. 6.6 tells us that we can expect a large asymptotic variance for some functions c when |1 — A;| 
is close to zero for some i, and also reassures us that |A;| ~ 1 is not a problem in general. 


Proposition 6.6. Suppose that c: X > R can be expressed as a linear combination of eigenvec- 
tors: 


e(x) =n + > Bev (ex) 
k=2 


Then, 


Os = > u(x)Z a(x, y)u"(y) 
x,YyEX 


where u* denotes complex conjugate transpose, and for each x,y € X, 


u(z) = > Bivi(a) 
1=2 
Za (wy) = m(e)a{e =o} — Sale) Ple,2)P4) 


zZEX 


Proof. Note first that some of the coefficients {3,} may be complex if any of the eigenvectors {v"} 
are complex valued. 
Consider for each x € X the martingale difference sequence: 


Ax(a) = 1{X(k) = a} — }01{X(k-1) = y} Ply, 2) 
¥ 
= 1{X(k) =2} — P{X(k) =| X(0),...,X(k-D} 


and let A; denote the corresponding m-dimensional vector sequence. It is not difficult to establish 
that a is the steady-state covariance of this sequence. In matrix form we have Xa = I — PTIIP, 
where II = diag (71) (the m x m diagonal matrix with entries 7(zx)). 

Without any assumption of stationarity, we have for any 7 > 2, 


Ai So vi(w)Ac(2) = v'(X(k)) — Av'(X(k — 1) 


x 
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Averaging each side gives 


1 S > Ai = Li 'Ow) —Ayu'(X(k — 1))} 
k=1 k=1 


Multiplying each side by 6;/(1 — A;) and summing 


Dae def n+— Sy 7a, Ak 


1 =a =o 


=m+t => > bilvi(X(n)) - 0X) 
i=2 


It follows that the asymptotic variance o2,,, coincides with the ordinary variance of the martingale 
difference sequence {Af} in steady-state. It can be expected to be very large if |6;/(1—Aj;)| is large 
for one i. The vector 6 depends entirely on the function c, and the quantity |1/(1 — A;)| is large if 
the eigenvalue A; is near unity. O 


6.7.3. Sample complexity 


What if a random confidence interval gives you no confidence? A deterministic confidence interval 
requires what is known as a finite-n bound, such as 


P{ltm — | 2 r/Vn} < B(n,r) (6.42) 


with computable right hand side. It is more common to consider the probability of error exceeding 
a fixed value ¢ > 0, and a bound of the form 


P{ltm — | = e} < bexp(—nI(e)) (6.43) 


where 0 is a finite constant, and I(e) > 0 for ¢ > 0. This leads to sample complexity bounds for 
estimating the mean—the terminology is explained in Section 8.1 in a more general setting. 

In the majority of applications, you are out of luck. If you want (6.42) or (6.43) with explicit 
values for the right hand side, then you require substantial assumptions on the Markov model and 
the cost function. For example, bounds are available for a finite state space Markov chain, but the 
bounds are very loose unless there is substantial prior knowledge available. See the Notes section 
for background. 

The example that follows illustrates what can go wrong. 


6.7.4 A simple example? 


The M/M/1 queue introduced in Example 6.2.3 should be the best case scenario for efficient sim- 
ulation. This Markov chain enjoys the following properties whenever p < 1: 


(i) The unique invariant pmf is geometric: 7(x) = (1 — p)p?, x = 0,1,2,.... 
(ii) It is skip-free: |X(k + 1) — X(k)| < 1 for all k. 
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(iii) It is geometrically ergodic: the limit in (6.6) holds geometrically fast, for all x, S. 
(iv) It is reversible: this means that reversing the direction of time does not change the statistics 
of X in steady state. Equivalently, the detailed balance equations hold: 
nc) P(x, y) = ay) P(y;2) ; x,ye€X= {0,1,2,...} 


These desirable properties should have positive implications for simulation. 


In fact, the M/M/1 queue and other skip-free Markov chains pose challenges, as illustrated here 


for the simple task of estimating the steady state mean 7 = 7(c) with c(a#) = x. We don’t need to 
simulate, since we know 7: 


[oe] 
p 
f-U=—) 0 =_ 
n=0 — ie 


The experiments that follow illustrate what can go wrong in even the simplest examples when the 
state space is not finite. 


Challenges: 
#1 The asymptotic variance is massive 
#2 The CLT holds, but the empirical distribution appears skewed for finite-n 


#3 There is no known sample complexity bound 


The ordinary variance is 0? = 1(é) = m(c*) — 7? = p/(1— p)*. The asymptotic variance is, as 
stated above, massive: 


Gi = eee a (8 + o(1))o* 


where o(1) = O(1 — p). This approximation follows from [254, Prop. 11.3.1]. 


1000 - bis 
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—1000 —500 0 500 1000 
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Figure 6.6: Histogram of estimates of the mean in the M/M/1 queue 


Fig. 6.6 illustrates the CLT for this example with p = 9/10. The histogram shows results from 


2 x 10* independent runs, with initial condition drawn independently from 7. The asymptotic 
variance estimate Gc ~~ 


8(1 — p)~* results in the standard deviation approximation coy; © 283 
indicated in the figure. 


Despite the skew observed in the histogram, the CLT remains a reliable predictor of algorithm 


performance in this example. In particular, we find that the time horizon is too short. Consider 
the approximation 


P{lnn — | > 2ocur/V/n} » P{|W| > 2} ~ 0.05, W ~ N(0, 1) 
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With n = 10° we have 2ocp7//n © 1.8 = 0.27. That is, in 5% of our experiments, the error is 
over 20%. If we increase the time horizon to n = 10", then the error reduces to 2% with the same 
frequency. 

A finite-n bound of the form (6.43) is not available, but we can establish the following (highly 
asymmetric) one sided asymptotic bounds: for functions J_, I,: Ry > R4, and all e > 0, 


1 
lim — log P{m — 1 < —e} = —I_(e) < 0 (6.44a) 
noo nN 
4 
noo a a ca ~ , 
lim — log P{ in n> ne} Se). 26 (6.44b) 


The first limit is what is expected for a finite state space Markov chain. 


4X(n) = ee Highest (theory) A X(n) Sa Biggest area (theory) 
a’ 7 Highest (observed) 50; —— Biggest area (observed) 


\ f F fi nan 
0 1 2 3 4 5 X10 


Figure 6.7: Sample paths of the M/M/1 queue: two large excursions 


The probability in the second limit (6.44b) is not a typo: it is the probability that the error 
exceeds n times ce. See the Notes section for history and resources. 

It isn’t difficult to explain the source of asymmetry: first observe that 7, is non-negative for all 
n, so that P{n, — 7 < —ne} = 0 for all large n; this is consistent with (6.44a). To understand the 
upper tail (6.44b), write n, =n7!S, with S, = ia X(k), so that 


P{ tn = 2 ne} = P{ Sn 2 nn +nee} 


Examples of large excursions of the queue length process are illustrated in Fig. 6.7. In each of the 
two sample paths shown, the maximal height of X is roughly proportional to the length of the 
time horizon. The partial sum S, may be regarded as the area under the trajectory formed by 
{X(k) : 0 < k < n—1}, which is of order n? over much of the run in these two plots. This is 
consistent with (6.44b). 

The performance criterion (6.43) is focused exclusively on rare events. That is, we know that 
P{|7m — | > €} will go to zero very quickly. The asymptotic covariance is grounded in typical 
performance, which helps to explain why the CLT holds under very general conditions. 


6.7.5 Combating variance by design 
Here are a few techniques to cope with high variance. 
Common random numbers Suppose we have a family of Markov chains indexed by a parameter 


6€R?: 
X°(k +1) = F(X%(k), N(k+ 1); 4) (6.45) 
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The goal is to estimate the parameter 0* minimizing ng = 79(c) for a function c: X > R. If 
using (6.37) to estimate ng for several different values of 6, be sure to re-use the sample path 
{N(k):1<k <n} for each 0. 


50+ 


(0? + 1)9° 


40 | 
Common randomness m Common randomness 


30+ 


Independent randomness = Independent randomness 


20+ 


0 0.2 0.4 0.6 0.8 10 


Figure 6.8: Common random numbers used to reduce relative variance. 
Consider for example a scalar linear system 


X°(k+1) = (1-0) X°+N(k+1) 


with an i.i.d., zero mean disturbance N, c(x) = x, and our interest is minimizing over 6 € R the 
objective T(@) = (6? + 1)n°. This is easily computed for this simple example: 


(1) 5 

= T= Ren 
The upper plots in Fig. 6.8 compare estimates obtained with common random numbers, and those 
obtained with independent randomness (meaning that the samples of the disturbance N(k) are 
chosen independently for each value of #). The plot obtained using common random numbers is 
smooth as a function of @. 

Independent trials were performed, where in each of 500 experiment the minimizer was computed 
for each approach. The histograms shown in Fig. 6.8 confirm that the estimate of 6* ~ 0.6 is far 
more reliable when using common random numbers. 


Split sampling Consider estimation of an expectation E[L(X(k), X(k+1))] where X is a finite 
state space Markov chain with unique invariant pmf 7. The standard estimator is 


N-1 
tee 7 iy L(X(k), X(k +1) (6.46) 


If the second eigenvalue of the transition matrix P is close to unity, then the mixing time of the 
Markov chain X will be slow, and this will adversely affect the convergence rate. In this case the 
following variant can be used, known as split sampling. Let X‘ denote an i.id. sequence with 
marginal 7. Construct a second stochastic process as follows: For each k = 1,2,... the random 
variable X?(k) is chosen in two stages. First, the value « = X!(k —1) is observed. Next, the value 
of X?(k) is chosen according to the distribution P(, - ), independent of {X'(j), X?(j) : 7 < k-1}. 
Based on this pair of stochastic processes, the estimate at iteration N is defined by 


N-1 
ie ~ > L(X1(k), X2(k + 1) (6.47) 


See Exercise 6.15 for instructions on how to obtain an expression for the asymptotic variance of this 
estimate, and how to compare it to the standard estimator. You can expect significant variance 
reduction when o2,,, >> 0°. 
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Control variates Suppose that there is a d-dimensional stochastic process {A(k)} that is corre- 
lated with the sequence {c(X(k))}, and for which we know the steady-state mean is zero. That is, 
the following limit holds with probability one, for each 1 <2 < d: 


n—-1 


. tL 
im . 2 lh) = 0 


For each v € R? denote A,(k) = 5+ u;A;(k), and define the new sequence of estimates 


n-1 


th = — d[e(X(k)) + Av(k)] (6.48) 
k=0 


Significant variance reduction is possible in some cases—see the Notes section for examples and 
resources. 


State weighting and likelihood ratios Consider the mean-square error (6.22) defined with 
respect to the invariant probability measure 7. Algorithms to minimize this loss function may 
suffer from very high variance. One way to reduce variance is to introduce a weighting function 
w: X — Ry (it might also be interpreted as an unnormalized likelihood ratio). The weighted 
mean-square error is denoted 


It — Allow = E[[h(X(k)) — R(X (k))Pw(X(k))] when X(k) ~ 0 (6.49) 
For example, in the M/M/1 example for which 7 is geometric, and h is a quadratic function of z, 


we might choose w(x) = 1/(1+ 2*), so that the product within the expectation is bounded. 


6.8 Sensitivity and Actor-Only Methods 


Let’s turn now to a control problem. Rather than the feedback formulations developed in Part I, 
we are presented with a parameterized family {P9,79,c9,n9 : 9 € R47}, subject to the following 
assumptions: 


Assumptions for a Markov family. There is a common state space X, and for each 6: 
A 7g is invariant for Ps 
Ace: XR, and no = 716(co) 


A There is a solution hg to Poisson’s equation: 


co + Pohe = ho + 16 (6.50) 


A The average cost 7 and other functions of # are continuously differentiable in 0. 


The assumption that jg and hg are continuously differentiable functions of 6 is not strong. See 
Exercise 6.21 for hints at how to obtain a representation of the gradient of hg, subject to mild 
assumptions on the parametrized model. The Notes section contains references with full details. 
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The control objective is to minimize the loss function [(@) = mg over all 6 € R¢@ and, as in 
Chapter 4, our approach is to approximate gradient descent: 491 = —VI(%). 

A formula from the 1960s provides one avenue for approximating the gradient, based on the 
so-called score function, defined for a finite state space model as follows: 


S°(a, x’) = Vo log(Pa(a, 2’)), z,z’ EX (6.51) 


Lemma 6.7 is merely a restatement of this definition, recalling the chain rule for the gradient of a 
logarithm: 


Lemma 6.7. For a finite state space Markov chain, and any function g: X > R, 


Vo{ >. Pole, 2")g(a’)} =5S— Py(2,2')S°(a, x'\9(2") O 


The proof of the following representation is found at the end of this section. 


Theorem 6.8. (Sensitivity Theorem) Suppose that the assumptions of this section hold, and 
in addition X is finite, cg(x) is continuously differentiable in 6 for each x, and the score function 
is continuous at each value of 0 for which Po(x,x') > 0. Then, 


VI'(0) = Eng [Voco(X(k)) + S°(X(k), X(k + 1))ho(X(k + 1))] (6.52) 


where the expectation is in steady-state. 


The stochastic approximation theory of Chapter 8 invites the stochastic gradient descent (SGD) 
algorithm ; 

On+1 = On — Qn4+1VF(n + 1) 

Vr(n +1) ual [Voce (X(n)) + S°(X(n), X(n+1))ho(X(n+ 1)) = 
where {a,,} is a step-size sequence, as seen in many earlier chapters. In practice, it is likely that 
we have designed Py and cg, so the score function is available. A challenge with this algorithm is 
the computational complexity associated with computing he, at each iteration n. 

The update equation for 8 is an example of the actor in the RL literature, and the relative 
value function hg, is the critic at iteration n. The actor-critic methods surveyed in Chapter 10 
provide a computationally feasible refinement of eq. (6.53), in which the relative value function is 
estimated simultaneously with estimation of the parameter 6* that minimizes I. 

How is the score function defined for a general state space? Suppose that X = R” and there is 
a transition density: 


(6.53) 


Ce | padi. 2X, ERS 
yEA 


In this case the definition is analogous to the finite state space case 
S°(x, x') = Vg log(pe(x, x')) , xix EX (6.54) 


Lemma 6.7 continuous to hold, subject to smoothness assumptions on the density, and growth 
conditions on g. 
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Consider for example the nonlinear model with additive noise, X(k+1) = F(X(k);@)+N(k+1), 
where the marginal of N has density py. For g: X > R we have 


Pog (2) = / g(F(#;9) + 2) py(2) dz 


Consequently, it would appear that V Pg requires differentiation of g. This is avoided through the 
change of variables y = F(x; 0) + z: 


Pog (2) = : aly) po(2,y)dy, — p9(2,y) = pw(y —F(2;6)) 


and from this we obtain a version of Lemma 6.7: 
VoPog (x) = Elg(X(k + 1))8°(X(k), X(k + 1)) | X(k) = 2] (és 
S°(x,x') = Veo log{pn(a’ — F(a; 0)) } ; xg,xc' EX ) 


Proof of Thm. 6.8. Expanding Poisson’s equation gives, for each x and 6, 


co(x) + S— Poa, 2')ho(a’) = ho(x) + T(8) 


g! 


We then take the gradient of each side, applying the product rule: 


Veo(x) + ¥—{[VPo(a, 2')]ho(a") + Po(a, x')Vho (x')} = Vho(x) + VT(6) 


gz! 


The introduction of the score function is only so that we can bring Ps outside of the brackets in 
the left hand side (essentially an application of Lemma 6.7) to obtain 


Vce(x) + S > Poa, 2’) {$9 (a, 2 )hg(a') + Vho (2’)} = Vho(x) + VTA) 


Multiplying each side by 7t9(a) and summing then gives 
En, [Vco(X (k)) + S°(X(k), X(k + 1))ho(X(k + 1)) + Vho (X(k + 1))] = En, [Vio (X(k))] + VI(8) 


The proof is completed on canceling E,,[Vhg (X(k + 1))] = En, [Who (X(k))] from each side. O 


6.9 Ergodic Theory for General Markov Chains* 


Spectral theory for Markov chains can be extended far beyond the finite state-space setting. In 
particular, a full generalization of Thm. 6.2 is available, known as the V-uniform ergodic theorem 
[257]. Ergodic theory and bounds on solutions to Poisson’s equation are important in the remainder 
of the book, but you will have to go to other sources for details: a survey with references can be 
found in the Appendix of [254], and a high level survey is provided here. 
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6.9.1 Classification 


For a Markov chain on a general state space, it remains true that Ay = 1 is always a right eigenvalue. 
Recalling the notation in (6.9b), we have P1 = 1, where “1” denotes a function: 1(a) = 1 for all 
x € X. More generally, for each k > 1 and a € X, 


P¥1 (x) = E[1(X(r +k)) | X(r) =2] =1 


The assumptions on eigenvalues appearing in Thm. 6.2 are traditionally replaced with the following 
notion of irreducibility and aperiodicity. A probability measure w on X is identified, and we then 
have the following classification: 


Classification of Markov Chains. 
(i) The chain is w-irreducible if, for any set A € B(X) satisfying w(A) > 0, and any x € X, 
there is n > 1 satisfying P”(x, A) > 0. 


(ii) The chain is aperiodic if, for any set A € B(X) satisfying y(A) > 0, and any x € X, there 
is no > 1 satisfying 
Beg A\ > 05 n> No 


These definitions appear un-verifiable: how can one test every A € B(X)? One approach is 
through verification of a minorization condition: A set S € B(X) is called small if there is a 
probability measure v, 6 > 0, and a time n such that 


P"(x,A)>6v(A), «eS, AE B(X) (6.56) 


This isn’t as hard to verify as it might appear. For example, it is easy to construct small sets if 
X = R” and P” has a continuous density. 
Lemma 6.9. Suppose that there is a pair (S,v) satisfying (6.56). Then, 


(i) Suppose that \°7° 5 P¥(x9,S) > 0 for every 9 € X. Then the chain is w-irreducible, with 
w=v. 
(ii) Suppose that for any x € X, there is no > 1 satisfying 


Bg) 0s n>no 


Then the chain is also aperiodic. O 


The situation is even simpler when X is countable, since every singleton is small (a set S$ 
consisting of just one element). Suppose that there is x* € X satisfying 


(oe) 
Poe) >0 for every xp € X. 
k=0 


Applying Lemma 6.9, using S' = {x*}, we see that the chain is v-irreducible, with v(-) = P(a’, -). 
In this case, the term uni-chain is substituted for -irreducible (or we say the chain is x*-irreducible 
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if we want to emphasize the particular reachable state x*). The definition of aperiodicity is also 
more easily verified: there is no > 1 satisfying 


Pre a} 0; n> no (6.57) 
One celebrated representation of an invariant pmf 7 is known as Kac’s Theorem: 


Proposition 6.10. (Kac’s Theorem) Suppose that X is a Markov chain on a countable state 
space. Suppose that x € X satisfies Ey[Tx*] < oo. Then there is an invariant pmf satisfying 
m(x*) > 0, and for all functions g for which m(\g|) is finite, 


Tze —1 
m(g) © Sm(e)g(x) = n(2")Em | S> 9 X())| (6.58) 
k=0 
The invariant pmf is unique if ® is uni-chain. O 


The connection with the finite state space setting is made clear in the following: 


Proposition 6.11. The following equivalences hold for a finite state space Markov chain: 
(i) The eigenvalue A, = 1 is not repeated <=> the uni-chain assumption holds. 


(ii) The eigenvalue A, = 1 is the only eigenvalue satisfying |A| = 1, and this eigenvalue is not 
repeated <=> the chain is aperiodic. O 


6.9.2 Lyapunov theory 


Recall that the coercive assumption was a useful property for Lyapunov functions in a deterministic 
setting: V is coercive means that the sublevel set Sy(r) is a bounded subset of X for any r. In the 
theory of ~-irreducible Markov chains, the sublevel sets are small, either by direct assumption, or 
a consequence of other assumptions. The two Lyapunov conditions (V3) and (V4) that follow are 
examples. Each of these bounds implies a “negative drift” whenever the state is outside of a small 
set: 

E|V(X(k+1)) | X(0),... , X(k)] < V(X(A)) whenever X(k) €S 


Assumption (V3) below is a refinement of the Poisson inequality (6.27). 


Theorem 6.12. (Lyapunov condition for Poisson’s equation) Suppose that the Markov 
chain is w-irreducible, and there exists a solution to the following Lyapunov bound: For a non- 
negative valued function V on X, a small set S € B(X), b < co, and a function f: X — [1,co), 


PV (a2) < V(a) — f(x) + b1 g(x), zrEX. (V3) 


Then, there exists a unique invariant probability measure nt. It satisfies m(f) < oo, and the following 
additional conclusions hold for any function c: X > R satisfying 
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(i) There exists a solution to Poisson’s equation (6.23) with n = 1(c) and 


h 
sup La) 


< 
V@ja1 
(ii) Suppose that the sublevel set S.(r) is small or empty for each r, where 


Selr) ={e eX tele) <r} 
Then the function h in (i) can be chosen so that h(x) > 0 for each x. 


O 
There are also ergodic theorems available under (V3)—for details see [257]. 


In most cases a far stronger drift condition is available, and from this we obtain a strong 
form of geometric ergodicity. The interpretation of a Lyapunov function as a weighting function is 
particularly useful in the definitions that follow. For this we assume that V(a) > 1 for each x, and 
denote for c: X > R, 


le()| 
cllv =su 
llellv = sup 7 (x) 
Let LY, denote the set of all Borel-measurable functions for which this is finite. We say that the 
Markov chain is V-uniformly ergodic if there exists an invariant measure 7, along with constants 
p<1, B<oo such that for any c € is 


|E.[c(X (k))] - n(c)| < Bp*|IellvV (2), k>0,2%EX (6.59) 
Or, in operator-theoretic notation, on denoting ¢ = c — n(c) 

|P*éllv < Bo*liclv, k>0 
Theorem 6.13. (Lyapunov condition for V -uniform ergodicity) 


Suppose that the Markov 
chain is w-irreducible and aperiodic, and the following drift condition holds for a function V: X + 
[1,00): for constants e > 0, b < oo, and a small set S € B(X), 


PV (a) < (l—«)V(a) + bl g(a), rEX. 
Then the Markov chain is V-uniformly ergodic. 


(V4) 


O 
The fact that we can identify the state-dependency in the bound (6.59) is fantastic news. 
Unfortunately, obtaining bounds on B or p is not at all easy [256, 233, 302, 303]. 


6.10 Exercises 


There are many exercises in this chapter: it is very important to understand this material and 
also the notation in order to follow the final chapters of this book. 


6.1 Consider the two-state Markov chain on X = {0,1} with transition matrix 
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(a) First, think probabilistically: Explain why the first hitting time to z = 1 from the starting 
point X(0) = 0 has a geometric distribution on {1,2,...}. Compute the invariant pmf 7 using 
Prop. 6.10 (Kac’s Theorem). 


The rest is algebra: 
(b) Obtain the spectral representation of P, 


P=A)yulp! + Agu? py? 
where {A;} are eigenvalues, {j'} are left eigenvectors (taken to be row vectors), and {v'} are right 
eigenvectors (taken to be column vectors). 


(c) Find an expression for P” based on (b). At what rate does P”(i,7) tend to 7(j)? 


(d) Choose a pair of two dimensional vectors w and v. The product wvT is then a 2 x 2 matrix 
(often written w ® v in this book for emphasis). Compute the inverse, 


G=[I-(P-wv)}! 


and verify that the row vector vG is proportional to 7. If you are unlucky, and the inverse does 
not exist; if so, then try a different pair! 

This is an example of the Perron-Frobenius construction from Section 3.2.3, for which there is 
theory available to ensure invertibility. See Exercise 6.20 for another example. 


6.2 Let X = (X(0), X(1), X(2),...) denote a Markov chain on the state space of three elements 
X = {1,2,3}. Its transition matrix is of the following form: 


1-— €9 £9 0 
— Ey 1 — 2¢e Ey (all elements are non-negative) 
0 €9 1-— &9 


(a) Verify that v = (€1,€2,€1)7 is a left eigenvector: 
eP=]— 


and from this obtain 7. 

(b) Find all eigenvalues of P (perform your calculation by hand). 

(c) Verify that that P* (i,7) converges to a limit 7(j), as k > oo, for any 4, j. 
(d) Fix numerical values for P; choose small, distinct values for €1, €2. 


Plot log(P{X(k) = 1 | X(0) = 1}), for 0 < k < 100. On the same figure, plot log(|A2|*) where Az 
is the “second eigenvalue” of P (the one with second largest magnitude). 
Discuss your findings. 


6.3 Conditional expectation and preparation for Chapter 9. Suppose that X,Y are scalar-valued 
random variables with finite second moments. The conditional expectation X = E[X | Y] is defined 
to be the solution to minimum norm problem: X = g*(Y), with 


g* = argminl(g) = argminE[(X — g(Y))?] 
g g 


Consider X = 1/(1+ Y), with Y uniformly distributed on the interval (0, 1]. 
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(a) Compute E[X | Y] by constructing g*, resulting in T'(g*) = 0. 
(b) In many applications it isn’t so easy, so we opt for an approximation. It is common to restrict 


to a finite-dimensional set of functions. That is, for dimension d, assume that we have d functions 
{v1,.-., Wa}, denote go = >>, Oxv, for 0 € R¢, and define X = gg«(Y) with 


6” = arg min (go) 
9 


Obtain an expression for 0* based on expectations such as 
+X 
RY =EW(Yy(Y)), OF = Elvi(Y)X] 


(c) Discuss how you would compute the solution to (b) using simulation. Say, given i.i.d. samples 
{Y,} each with uniform distribution on [0, 1]. 


(d) Compute or approximate 6* for a basis of your choice (Fourier or polynomial are acceptable) 
with d = 1,2,3,4,5. On the same figure show the five plots of gg«(y) as a function of 0 < y < 1, 
and also include a plot of g*. You should get a good approximation if your basis is reasonable, and 
d isn’t too small. 


6.4 Let X be a time-homogeneous Markov chain on the finite state space X = {1,..., N}, and let 
P denote the N x N transition matrix. For any function h : X > R, review or recall the smoothing 


property: 
E[h(X(k + 2)) | X(k)] = E[E[A(X(k + 2)) | X(k), X(k + 1)] | X(k)] 


Explain how this implies the representation for the 2-step transition matrix: for each i € X, 


E[h(X(k +2) | X(k) =i] = )) P?(i, sh), 
j 


where P? = P.- P is the usual matrix product, and P?(i, 7) is the (i, j)-th entry of P?. 

6.5 ODE method for Lyapunov function construction. Consider the nonlinear state space model 
X(k+1) = X(k) + F(X(k)) + sN(k 4+ 1) 

where F is a scaling of the vector field in Exercise 2.13: 


ee 
[+e 


F(a) 


= —dtanh(z/2), with 6 > 0. 


The disturbance {NV (k)} is i.i.d., and uniform on [—1, 1]. That is, its density is supported on this 
interval, with fy(x) =1/2 for -l<a<1. 

Take the function V you obtained for the continuous time model in Exercise 2.13, and see if you 
can establish a similar bound: PV < V —c+7%,, where 7, will probably grow as the cube of s for 
any fixed 6. You may have to modify V slightly, depending on your choice. 


You will test the resulting bound m(c) < 7, in Exercise 6.14. 


6.6 Let P be a transition kernel, c: X > R+ a cost function, y € (0,1) the discount factor, and 
recall the discounted cost value function was defined in (6.33). To begin, provide a proof that the 
DP equation (6.33) holds based on the representation (6.32). 
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(a) In Example 6.6.1 we obtained h., for the linear state space model. Was stability required to 
obtain a solution? That is, do we require |a| < 1? 


(b) Consider a finite state-space Markov chain that is uni-chain. Show that 
h(x) = hy()+k(y), 2 EX 


in which suppey<y lh, (a)| is finite for each x, and k(y) does not depend upon x. For this you will 
find the following useful: P” (x,y) = m(y) + P"(a, y) (see (6.19)). 

(c) Compute h., and k(7) for the four-state Markov chain with transition matrix (6.21), and with 
cn =a 


6.7 We can extend the Poisson inequality to something more closely aligned with the discounted 
setting. Suppose that (P,c,) are as in Exercise 6.6. Suppose also that we have a solution V: X > 
R, and 7 < & to the following “discounted” Poisson inequality: 


yPV <V-c+7 


(a) Show that h, <V-+const., and identify the constant (this is easy once you review the average 
cost bound obtained using Poisson’s inequality). 


(b) Obtain a solution to the inequality for the M/M/1 queue with c(x) = x. 
Is the load condition p <1 necessary to obtain a solution? 
6.8 On Poisson’s equation. This problem shows that a solution to a dynamic programming equa- 
tion must be interpreted with care. 
For the M/M/1 queue, with load p = a/w € (0,1), perform the following computations: 
(a) let V(x) =p -*, and compute the drift, 


A(z) = PV (x) — V(z), es | Ae eee 
(b) Compute the steady-state mean m of A. For this recall that the invariant pmf is 7(x”) = 


(1 — p)p”, so that m = 9). m(x)A(x). Verify that the mean is not equal to zero. Also verify that 
the mean of V is not finite. 


(c) Compute a solution h to Poisson’s equation, with forcing function A: 


Ph=h—-A+m 


6.9 Consider the following generalization of the M/M/1 queue on the non-negative integers Z4 = 
{0,1,2,...}. The transition matrix satisfies P(n,m) = 0 if jn — m| > 1. Equivalently, for the 
Markov chain X, 

|X(k+1)—X(k)| <1 for each X(0) and k > 1 


This is called a birth-death process. 


Verify that the detailed balance equations hold, whenever an invariant pmf 7 exists. That is, 
mt) P(t, 7) = mG)PG, 4) 


Can you obtain an expression for 71, similar to the M/M/1 queue? 
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6.10 Consider the two dimensional linear state space model, 
X(k+1) = FX(k)+GN(k +1) 


where F is a 2 x 2 matrix, with eigenvalues satisfying |A] < 1, and G € R?. The disturbance N is 
iid. on R with zero mean and finite variance. We have seen that X(k) converges in distribution 
to the random variable A 
Xoo = S> FIGN(i) 
i=0 


Convergence in distribution means that for any bounded and continuous function g: R? > R, 
lim E[g(X(k))] = Elg(Xoo)] 
k-0o 


This convergence holds for any given X (0) € R*. Moreover, if N is Gaussian, then so is ce: with 
covariance 

22 . . 

Dx. =on > F'GGtR" 

i=0 
Find an example in which Ux, is rank one, choose your distribution for N so that 0%, = 1, and 
proceed: 
(a) Find a (discontinuous) function g: R? + R and an initial condition X (0) for which E[g(X (k))] = 
0 for all k, yet Elg(X.o)] = 1 (ergodicity fails). 
(b) Solve Poisson’s equation PV = V —c+7 with c(x) = 2°. 
The remaining two parts are based on simulation for this example. 


(c) Average {c(X(k)) :1<k < T} and observe that the value 7 does approximate 7. You might 
plot this as a function of T (the range of T should be much, much larger than trace (x, )). 


(d) Observe the evolution of X on R?. Choose two initial conditions: One proportional to G, and 
one satisfying X(0)™G = 0. And, choose ||X(0)|| >> trace (“x,,) (say, 10 times larger). Provide 
plots of X along with discussion. 


6.11 Criteria for instability. Let X be an irreducible Markov chain on the non-negative integers 
X = {0,1,2,...}. That is, the Markov chain is x*-irreducible for every x* € X. 


Assume that there is a non-negative function V: X > Rx satisfying PV < V (such functions are 
known as super-harmonic). 


(a) Verify that M(k) = V(X(k)) satisfies the super-martingale property: 
E[M(k +1) | M(0),...,M(k)] <0 
Suggestion: first condition on X(0),...,X(k), and then apply the smoothing property of conditional 


expectation. If this language is not familiar, then skip to (b). 


A useful fact about non-negative super-martingales is that they are convergent: 


def 


where the limit exists with probability one, but may take on infinite values. 


(c) If V is not identically constant, show that the Markov chain is transient. For this, you must 
show that there are two states x and y such that P{ty = 00 | X(0) =x} > 0. 
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6.12 Consider the M/M/1 queue, with load p = a/ > 1, so that the system is not stable. Show 
that the function V of Exercise 6.11 defined by V(x) = p~” is super-harmonic. 
6.13 Simulation theory and practice. Consider the Markov chain with transition matrix (6.21). 


(a) Obtain the invariant pmf 7, and the solution to Poisson’s equation h with c(x) = x. For this 
you might review Perron-Frobenius theory in Section 3.2.3 (see also Exercise 6.20). 


(b) Compute o2,,, based on 7, h,c (review Section 6.7.1). 


(c) Test the predictive value of o2,,, through the batch means methods described in Section 6.7: 


Run M = 500 (or more) independent simulations to obtain the M estimates {n\; : 1 <i < M} 


defined in (6.41a). Based on this data, obtain an approximation G2,,,(V) based on a histogram of 


{VN (ny — tw): 1 <i < M}. 
Repeat for NV = 10™ for m = 2,3,4,... (stop when your computer power can’t keep up). 
Discuss your findings. In particular, does your estimate for smaller values of NV provide insight into 
how long you must simulate to obtain a good estimate of 7 = m(c)? 
6.14 Simulation theory and practice. We return to the example in Exercise 6.5. 
(a) If you haven’t done so already, obtain the solution to Exercise 6.5, and review Section 6.6 to 
understand why 7(c) < 7,. 
(b) Solve Poisson’s equation numerically for several values of 6 and s, and plot your result. 


(c) Estimate 7 = 7(c) through simulation for at least five values of 6 (as small as 0.01 and as 
large as 0.5), and several values of s. Perform multiple independent runs to obtain estimates of the 
error in your estimates. 


Suggestions and warnings Review Section 6.7.5 before you begin, especially the value of com- 
mon random numbers. Keep in mind that the dynamics are similar to a queueing model, in that 
the mean drift 


E[X(k +1) —X(k) | X(k) = 2] = F(z) 
is nearly constant (independent of «) when |z| is large. The steady state mean 7(c) grows as 6-1, 


and the asymptotic variance o2,,. grows as 6~* for vanishing 6. 


6.15 The M/M/1 queue has a geometric pmf with parameter p. Suppose that we wish to estimate 


the autocorrelation via 
=I 


1 
=F xw X(k+1) 
k=0 


so that in the notation of (6.47) we have L(X(k), X(k+1)) = X(k)X(k+1). 
(a) Compute or bound the asymptotic covariance of this estimator — it will be huge for p ~ 1! 


(b) Obtain a formula for the asymptotic covariance using split sampling, as defined in (6.46). 
This is easy, since the samples are i.i.d., and the marginal of X is geometric. 


How do the asymptotic variances compare? 
Sensitivity: 


6.16 This exercise provides an exploration of Schweitzer’s formula (6.52) for the CRW queue 


X(k+1) = X(k) —U(k) + A(k +1) 
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in which there is a cost for control, and cost for queueing delay. A randomized policy is denoted 


0 def 


b (1| x) = P{U(k) =1| XG, UF, AP}, = -X(k) = 2. 


0 
Here we focus on a simple special case in which } (1 | x) = 01{x > 1}, with 6 € [0, 1], and consider 
the cost function cg(x) = x + c29”, with co > 0 and p> 1. 


a) Obtain an expression for Pq, for an arbitrary function g: X > R. 
dO 


(b) Compute 7 and the solution to Poisson’s equation hg with forcing function cg. Compute ny by 
directly differentiating your formula for ng. Verify that your answer is consistent with Schweitzer’s 
formula. 


(c) Plot 7 for 6 € (0,1) to find the best policy. Use your preferred value of cz > 1, p > 1, and 
take a = E[A(t)] = 5. 


6.17 Consider the one-dimensional dimensional linear state space model 


X(k+1)=X(k)+U(k) + N(k+1),  U(k) =—0X(k) 


2 


in which N isi.i.d., scalar valued, and with marginal N(0, 1). Consider the quadratic cost c(#) = x*, 


and let 7g denote the steady-state average cost. Our goal is to minimize the loss function 


r(6) = lim E[X(k)? + U(k)?] = no[1 + 67 


(a) Obtain an expression for I'(@), including the range of 0 € R for which it is finite, and a formula 
for the minimizer 9*. Is T convex? 


(b) Obtain an expression for VI'(@) using Thm. 6.8, and verify that it agrees with (i). In particular, 
verify that VI'(0) = 0 when 0 = &. 


6.18 Consider the generalization of Exercise 6.17 to the linear model with n-dimensional state 
process and scalar input: 


X(k+1) = FX(k)+GU(k)+ N(k+1), U(k) = —KoX(k) 
in which WN is ii.d., with marginal N(0,uy). Assume a linear parameterization for the gain: 
Ko = San 6,K° for 6 € R¢, where each K* is n x 1. 
(a) Obtain an expression for the score function $°(a,y) for the special case (6.55). 


(b) Find an expression for hg with forcing function c(x) = ||a||?, of the form hg(x) = 27 Moz. 
Specification of Mg > 0 will involve a Lyapunov equation. 


(c) Obtain an expression for VI'(@) with 


def 


P(8) = Eng (c(X(k)) +rU(k)*],  U(k) = KoX(k) 
for r > 0 fixed and where the expectation is in steady-state. 


Perron-Frobenius theory: Section 3.2.3 contains a brief introduction to Perron-Frobenius the- 
ory. Application to finite state-space Markov chains are surveyed in the exercises that follow. These 
concepts provide useful computational tools and insight. 


In each of the problems that follow we consider a Markov chain X on a countable state space X, 
with transition matrix P. 


Pre-publication draft -- March 25, 2022 


CHAPTER 6. MARKOV CHAINS 240 


6.19 An introduction. Suppose that P is “rank one”. This means that there exists a function s 
and a pmf v on X satisfying P = s ® v (that is, P(x, y) = s(x)v(y) for each x,y € X.) Show that 
s =1, v is an invariant pmf, and hence X is i.i.d. with marginal distribution v. 


6.20 Perron-Frobenius theory of positive matrices. Suppose that P is not rank one, but rather 
dominates a rank-one matrix: there exists a non-negative function s (not identically zero) and a 
pmf v on X satisfying 

P(x,y) > s(x)v(y) for each x,y € X (PF) 


The following power series expansion always exists: 


a=S(P-sev) 


> 
ll 
° 


and satisfies vG's < 1. This is the starting point of Perron-Frobenius theory. Hence G = [I — (P— 
s ®v)|~! whenever the inverse exists. 


(a) If 7 is an invariant pmf, so that 7P = 7, it then follows: 
ml -P+s@v|=dév 


where 6 > 0. Provide a formula for 6 in terms of 71, s and v. 


Argue that js = vG is a left eigenvector of P, provided the inverse G = [I — (P — s @ v)|~! exists, 
and 6 > 0. 


(b) Postulate on a representation for a solution to Poisson’s equation Ph = h — é, based on the 
matrix G. 


6.21 PF theory and sensitivity. We now revisit the sensitivity formula in Thm. 6.8. You will 
obtain an expression for the gradient of hg at a particular value 0° € R%, subject to these local 
assumptions: for some € > 0 


> Po(x,x’) and co(x) are continuously differentiable for 6 in the region B- = {6 : ||0—0°|| < e}. 
> The Markov chain with transition matrix Py is uni-chain for each 6 € B-,, and satisfies the 
uniform minorization condition for @ in this domain: 


Po(x,2') > s(2)v(2"'), x,v' EX 
where v is a pmf, and 7%(s) > 0. 
Obtain a formula for Vhg, using hg = Gece, where 
Go = [T—-Potsea@v]? 
The following formula will be useful: 


a) 


6) 
56, = Go|— 


a0; PalGo 


6.22 Risk sensitive control [374]. Another application of PF theory is to infinite-horizon optimal 
control with the risk sensitive criterion. 
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(a) The log moment generating function associated with a real-valued random variable © is 
A(r) = log E[exp(r=)], with r € R a variable. Assuming A is finite valued in a neighborhood 
of the origin. Obtain the first and second derivatives of A at the origin, to justify the “small r 
approximation”: 

A(r) = mer + soar" + O(r?), 


where the coefficients are the mean and variance of =. 


(b) One formulation of risk sensitive control is motivated by this Taylor series approximation, 


with r > 0 and 
N-1 


E= So c(X(k)) + Vo(XW)) 


k=0 
where NV’ > 1, ¢ is a cost function, and Vo is the terminal cost. Consider the finite state space 
Markov chain with X = {1,...,}, introduce an n x n matrix with entries 


R(i, 7) = exp(re(i))P(i, 7) 


and let (v,A) denote a solution to the eigenvector equation Rv = Av. Show that for each 7 € X, 


N-1 
Nui = DRM, Avy = E[r D> oX(k)) + V(X) | XO) = 
j k=0 


where V(i) = log(v;) for each 7. Conclude that A = log(A) is the infinite-horizon risk sensitive cost, 
defined by (using the notation (6.9)) 


N-1 


gia) = vim, 7 log{ Es lexp(r d e(X(k))) | \ ‘ X(0)=2EX 


(c) Solve the eigenvalue equation Rv = Av for the four-state Markov chain with transition matrix 
(6.21), using c(i) = 7, and verify the following identities 


4 A(r) ag N= MO): 4 log(v,(r)/v1(r)) 


r=0 


where fh is the unique solution to Poisson’s equation satisfying h; = 0. 


6.11 Notes 


As in [257], in this book the word “chain” indicates that time is discrete, and the term Markov 
process is reserved for models in continuous time. The terminology is in honor of the Russian 
mathematician Andrey Markov. Good introductory texts include [275, 138], and for more advance 
material see [117, 257]. Some of the material in this chapter and Appendix B is adapted from [254, 
Chapters 8 & 9] and its appendix. Standard texts on Perron Frobenius theory are [315, 277], and 
the appendix of [254] contains a survey aimed at readers with interests in control applications. 

The reference [257] remains an up to date source for Lyapunov theory and Poisson’s equation 
for Markov chains on a general state space. The book also contains spectral theory in the general 
state space setting, but this remains a rapidly evolving field [117, 196, 378, 197]. 
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An encyclopedia for theory and practice of simulation is [12]. See [194, 195, 253, 59] for special- 
ized theory on the CLT and other asymptotic statistics for Markov chains, written with attention 
to the applications of interest in this book. 

It was first demonstrated in [253, Prop. 1.1] that the finite-n bound (6.43) cannot be obtained 
for the M/M/1 queue and other “skip-free” Markov chains on a countable state space. Much more 
on this topic can be found in [254, Chapter 11] (along with details on the histogram Fig. 6.6), and 
(120, 119] for a complete explanation of Ii (€) appearing in Section 8.2.1. 

Techniques to obtain finite-n bounds through the introduction of control variates can be found 
in [194, 253]. See [158, 159, 157] for applications of control variate techniques to network simulation, 
and Section 10.7 for applications in RL. 

Schweitzer introduced his sensitivity formula in his doctoral thesis, which was subsequently 
published in 1968 [313, 251]. See Konda’s thesis for early history of gradient free optimization for 
application in RL [188, 191], and Section 10.10 of this book for more history. 
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Stochastic Control 


By stochastic control we mean that random disturbances and measurement noise are accounted for 
explicitly in our control system model. A Markov Decision Processes (MDP) is a very special case. 
While the theory in this part of the book will focus almost exclusively on MDPs, the title of the 
chapter is meant to emphasize our goals rather than a particular set of techniques to achieve them. 

Before we dive straight into theory, it will be helpful (and perhaps more interesting) to first 
survey examples to explain the similarity and differences between deterministic and stochastic 
control. A short primer on MDP theory is contained in Appendix B. 

The object of study in MDP theory is a state process X = {X(k) : k > 0} that takes values in 
a state space X. The evolution of X is influenced by disturbances, as in the nonlinear state space 
model (6.3), and also a control sequence U = {U(k) : k > 0} taking values in an input (or action) 
space U. As in deterministic control, our objective is to choose U(k) for every k > 0, based on 
observations, so that the system behaves as desired—the meaning depends on the application, as 
we will see in the examples that follow. 


7.1 MODPs: A Quick Introduction 


In this short introduction we assume that the state and input spaces are finite, or countably infinite. 
This greatly simplifies statements of the definitions, as in the uncontrolled setting. 

The first ingredient in an MDP is the controlled transition matrix: subject to an “admissibility” 
assumption defined below, the pair (X(k),U(k)) is a sufficient statistic in the following sense: 


P{X(k +1) =2' |X(0),...,X(k),U(0),...,U(k);X(k) =2,U(k) = u} = P,(2,2’) 


where the right hand side is the controlled transition matrix, evaluated at the triple (u, x, 2’). More 
generally, for any m,k > 0 and function g: X”*! > R, 


E[g(X(k),X(k+1),...,.X(k+m)) | X(0),...,X(k),U(0),...,U(k)] 
= E[g(X(k), X(k+1),...,X(k+m)) | X(k), U(k)] 


The definition is far less abstract when we have a realization of the controlled Markov model: 
X(k+1) = F(X(k), U(k), N(k+1)) (7.1) 
where WN is an i.i.d. sequence. The controlled transition matrix has the explicit form 
Pit.@ =P Figaie NA) = 2}, wEU, z,2/ EX 
243 
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Markov Decision Processes (MDP). The definition requires three ingredients: 


(i) The controlled transition matrix, denoted P,,(x, 2x’) for x2’ € X,we€ U. 


(ii) A cost function c: X x U > R, and input constraints represented by a set U(a) C U for 
each x € X. 


(iii) An objective, such as total cost: h(x) = SS Erle Xu (ek) 
k=0 


or average cost: 


noo nN 


n-1 
n= lim =~ Ep[e(X(&),U(k))] (7.2) 
k=0 


Once the objective is defined, the goal is to minimize over all “admissible” input sequences. 


As in the deterministic control setting, stationary policies play an important role. For any such 
policy @: X + U, if U(k) = @(X(k)) for each k, then the controlled process X is a Markov chain 
with transition matrix denoted Pg: 


Pee =a) x,v' EX (7.3) 


u=(x) ’ 
Also similar to deterministic control is the role of DP equations. For the total cost criterion, we 
have for each z, 


2 ‘ I) D* (nal 
ee) = min{ e(2,u) + » Pie nee )} (7.4) 
and the optimal policy is state feedback *, in which *(x) is any minimizer of (7.4), for each x. 
Minimization of the average cost criterion (7.2) revolves around a very similar DP equation, 
whose origins are explained in Appendix B. Under mild assumptions, the minimal average cost 
7 does not depend on the initial condition, and there is a solution to the average cost optimality 
equation (ACOE): 
h*(x) + o* = min{e(x, u) + > Pula ina) (7.5) 
U vw 
The function h* is known as the relative value function, and the optimal policy is again any 
minimizer: 
*(x) € arg min{c(z, u) + » Py(a,a')h*(x')}, «EX (7.6) 
er 


The “Q-function” of Q-learning is the function within the brackets: 


Q*(x,u) = e(x,u) + Sane) , ©EX,ueu (7.7) 


g! 


Average cost and transient performance bounds. The message from Section 6.6.2 de- 
serves repeating here: minimizing the average cost (7.2) does not mean we don’t care about 
the present. 
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Thm. 6.12 tells us that we can expect to find a solution to the ACOE for which h* takes 
on non-negative values, even for a general state space. Prop. 6.4 then applies with V = h* to 
obtain under the optimal policy, 


n-1 
1 1 
~S  Ele(X*(k), U*(k)) | X(0) =2] <n*+—hX(z), n>1ceEX 
Ova n 
For a finite state space, geometric ergodicity follows from aperiodicity, from which we obtain 
E[c(X*(k), U*(k))] = n* + €x, with ex, 0 geometrically quickly. 


7.1.1 Admissible Inputs 


Before optimizing it is necessary to first specify the class of allowable inputs. We already impose 
a hard constraint that U(k) € U for each k, and occasionally have state-dependent constraints: 
U(k) € U(x) when X(k) = x. We also impose causality, and to make this precise we require a bit 
more notation. 

We could settle for the following definition: we only allow input sequences of the form 


U(k) = 64(X(0),...,X(k)),  &>0 (7.8) 


where ;: X**+! — U for each k (perhaps also subject. to finer state-dependent constraints). How- 
ever, theory requires that we sometimes allow randomized polices for which U(k) depends on present 
and past state values, along with independent “noise” included for exploration, as discussed in Sec- 
tion 5.2. 


Admissible Input. Assumed given: a sequence of independent and identically distributed 
random variables €, evolving on a countable set O. The sequence is exogenous to the control 
system in the following sense: for any input sequence of the form (7.8), 


PIUX(k lj =o 6h - 1) —w | 20)... AR U0), Uk) Xk) = 2. Uk) = ah 


7.9 
= P, (a, 2')v(w') forallweU ,7,¢ €X, weO ce 


where 1(k) = (X(k), E(k)), and v is the pmf for €(k/) (independent of k by assumption). 


We then say that an input sequence U is admissible if it is a causal function of the joint 


_— U(k) = 6,(X(0),...,4(k)), k>0 (7.10) 


We have essentially enlarged the state space to X x OQ, with new state process 4. For any 
admissible input, 


PLA (ka 1) = (2) | AO )peg th), 010), UR AR) = (2) Uk) Ha} = Be ve) 


The right hand side is the controlled transition matrix for ¥. 
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Notation and Conventions The standard MDP terminology inserts ‘Markov’ in the definition 
of a stationary policy: 


A Markov policy: U(k) = b,(X(k)) for a sequence of maps d,: X > U, k > 0. 
A Stationary Markov policy: U(k) = b(X(k)). 


We will usually opt for simply ‘stationary policy’ or even ‘policy’ for @, which is also synonymous 
with the term feedback law introduced in Chapter 2. 

These definitions extend to randomized policies by substituting ¥ for X. However, it is cum- 
bersome to introduce & whenever we require randomization. We instead let ry denote a randomized 
stationary policy, which is defined as a conditional probability: for each u and z, 


P{U(k) =u| X(0),...,¥(k —1),U(0),...,U(k — 1); X(k) = 2} = b(u| 2) (7.11) 


Prop. 7.1 then follows from these definitions: 


Proposition 7.1. When the input sequence U is defined by a stationary (possibly randomized) 
policy , it follows that the state process X is a Markov chain, with transition matrix 


Ps(z, c= S- b(u | )P, (2, 2’), z,x' EX (fL2) 


O 


For a function h: X + R we use the following compact notation for conditional expectations: 
Pyh (x) = > b(u | x) Pula, 2)r(a’) = E[A(X(k + 1)) | X(k) = 2, U(k) =a 


def y : ji ake 
Poh (a) 22> Hu | 2)Pa(a,2")h(a") = E[A(X(k-+1)) | X(K) =a 


where X is controlled using policy ry in the second definition. 


For those of you who know something about stochastic processes The expression (7.9) 
is an example of an equation that is begging for streamlined notation. It is common to introduce 
a non-decreasing sequence of a-algebras (a filtration) as a way to model the history appearing in 
this equation: 

Fe= OA) op (KR), 0 (O)h co. UR} (7.14) 


If the “o” looks foreign to you, then simply take F; as a notational convention. 
We have under any admissible input 


PIX(R4+1) =a! | Fe X) =—2, Uk) =a} — Fez) (7.15a) 
and for any function h: X > R, 


E[h(X(k+1)) | Fas X(k) = 2, U(k) =u) =~ Pile,2’)a(2’) (7.15b) 


which is also expressed using the matrix notation analogous to (6.9b), 


Pih(a) = S_ Pila,e ae) (7.15c) 
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7.2 Fluid Models for Approximation 


It is often best to initially ignore disturbances, as a means to obtain intuition regarding a good 
policy. Many of the examples in this chapter are designed to illustrate this point. 
One approach is to consider the averaged dynamics associated with (7.1): 


F(a,u) £ E[F(x,u,N(k+1))], 2 eEX, we U(a) (7.16) 


which is independent of k since N is i.i.d. The associated fluid model is the deterministic state 
space model 


a(k +1) = F(2(k), u(k)) (7.17) 
It is sometimes simpler to introduce a model in continuous time, with vector field 


a7 def 


f(z, u) = F(z, u) —2, rEX, u€ Ula) (7.18) 


so that (7.17) is expressed as 7 
x(k +1) — x(k) = f(x(k), u(k)) 


and this is then approximated by the nonlinear state space model in continuous time: 
eee = f(a, us) (7.19) 


These deterministic systems appear in approximating a large number of interacting stochastic 
systems, where they are called mean-field models. We will sometimes use this language here, even 
when we are only interested in a single system in isolation. 

For either of the approximations (7.17) or (7.19) to be meaningful, we require that the state 
space X be a convex subset of Euclidean space, and typically also approximate U by a convex set. 
This can bring challenges with notation: In [254], which concerns control of queueing networks, two 
models are considered side by side. The state space X, is used for a countable-state space model, 
and X denotes a convex state space for a deterministic fluid model. It is simplest here to allow the 
definition of X and U to change with the context. 


Deterministic and stochastic control aren’t so different One justification of this claim is 
provided in Prop. 9.6, but this result requires more background then is available to us at this stage 
in the book. Justification is provided here based on comparison of DP equations for stochastic and 
deterministic models. 

Assume that the state and action spaces are convex subsets of Euclidean space, and consider 
any function J: X + R whose gradient VJ is continuous. The Mean Value Theorem (MVT) tells 
us that for any states x, x’ € X, there is a scalar o € [0,1] such that 


J(a2’') = J(x) + VJ (z)- {2' — 2}, = or'+(1—o)x (7.20) 


In the remainder of this subsection we show how to interpret the MVT with J = J* equal to 
the total-cost value function for either of the deterministic fluid models. It will be useful to impose 
a Lipschitz bound on the gradient: for a Lipschitz constant L > 0, 


|V J (2") — VJ (2)|| < Lla’ — xl], x,v' EX (7.21) 
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Fluid model in discrete time Let’s first consider the total cost value function J* associated 
with the (deterministic) fluid model (7.17). This satisfies the DP equation 


Ja) = min{c(, u) + J*(F(2, u))} (7.22) 


Apply the MVT (7.20) using 


def 


g = X(k+1) and = 2 = F(a, u) = E[X(k4+1) | X(k) = 2, U(k) =u] 
This gives, for any continuously differentiable function J, 
J(X(k+1)) = J(F(z,u)) + VI(X)- X(k+1) 


with X(k +1) F 


(7.23) 
= X(k+1) —F(z,u), 


and where X = 0X(k+1)+(1— @)F(z,u), with @ a random variable taking values in the interval 
[0,1]. 
Lemma 7.2 provides a link between deterministic and stochastic dynamic programming equa- 
tions. Denote 7 7 _ 
n(k+1)={VJI(X) — VJ (F(a,u))}- X(k +1) 


We have X — F(z, u) = oX(k +1), so that (7.21) implies the bound on 7 given in (7.24a). 


Lemma 7.2. Suppose that J: X > R is differentiable, and its gradient satisfies the Lipschitz 
bound (7.21). Then the following approximation holds for each k: letting x = X(k) andu = U(k), 


J(X(k +1)) = J(F(a,u)) + VJ (F(x, u))» X(k +1) + y(k +1) 


ss (7.24a) 
In(k + 1)| < LYX(R+ YIP 
Consequently, for any x,u and k, in the notation of (7.15), 
P,,J (x) = J(F(a2, u)) + 9(2, u) 
A(x, u) = E[n(k +1) | X(k) = 2, U(k) =u (7.24b) 
= E[U(X(k + 1)) | X(k) =a, U(k) =u) — J(F(z, u)) 
O 


The function 7 can be expressed in the more compact notation 
(a, u) = PyJ (x) — J(F(z,u)) 
Suppose that the function J is convex. Jensen’s inequality implies the bound 


E[J(X(k+1)) | X(k) =a, U(k) =u] > J(ELX(k +1) | X(k) = 2, U(k) = uJ) = J(F(z, u)) 


It follows from (7.24b) that 7 is a non-negative function of (x, wu) in this special case. 

Consider application of the lemma with J = J*. Provided the value function satisfies the 
assumptions of the lemma, the DP equation (7.22) for the fluid model implies a DP equation for 
the MDP model: 


ay = min{c(, u) — (x, u) + PyJ* (x)}, rEX (7.25) 
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In many examples to follow we find that 7 is small relative to c, so that J* approximately solves 
the ACOE. In fact, we don’t want this term to be small, but rather small in the span seminorm: 


III [sp = min max |7)(x, wu) — o| 
Oo tu 
Letting @° denote the minimizer and ce? (a, u) = c(x, u) — [A(a, u) — 0°], (7.25) becomes 
o° + J*(x) = min{c? (x, u) + P,J* (x)} (7.26) 
U 


Hence J* solves the ACOE with this cost function, and average cost 0°. 


Fluid model in continuous time ‘The value function J* for total cost in continuous time solves 
the HJB equation: 7 
0 = min{c(x, u) + f(x, u) - VJ* (x)} (7.27) 
UU 


Justification of the continuous time model requires a different application of the MVT (7.20). 
We take x’ = X(k +1) as before, but now x = X(k) to obtain 


J(X(k+1)) = J(X(k)) + VJ (X) - {X(k +1) — X(k)} 
where X lies on the line segment with endpoints X(k) and X(k +1). Denote 
m(k + 1) = {VJ (X) — VI (X(k))} - {X(k + 1) — X(k)} 


Lemma 7.3. Suppose that J: X > R is differentiable, and its gradient satisfies the Lipschitz 
bound (7.21). Then the following approximation holds for each k: 


J(X(k+1)) = J(X(k)) + VI (X(k)) - {X(k +1) — X(k)} + (Kk +1) (7.282) 
In(k + 1)| < LI|X(k + 1) — X(K)II? . 
Consequently, for any x,u and k, 
P,,J (2) — J(x) = VJ (2) - f(x, u) + (2, u) 
(7.28b) 
n(x, u) = Eln(k +1) | X(k) =x, U(k) =u] 
O 


Applying the lemma with J = J* solving (7.27) we again arrive at the representation (7.25), 
whenever J* satisfies the assumptions of Lemma 7.3. In practice, these two lemmas are useful as 
motivation to initially ignore disturbances in control design. Simple applications can be found in 
the remainder of this chapter: 


(i) The continuous time fluid model provides approximations for the ACOE in queueing net- 
works. The single queue is considered next, and a generalization follows in Section 7.4. A 
significant part of the monograph [254] is dedicated to these approximations. The strongest 
theory relates stability of a Markov model and stability of its fluid model. 


(ii) The discrete time fluid model predicts the solution to the ACOE exactly for the LQG model 
considered in Section 7.5. This is because the function 7(2,u) appearing in Lemma 7.2 does 
not depend on (z, u). 
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7.3 Queues 


The reflected random walk, introduced in Example 6.2.2, is a common model for the evolution of 
workload in the single server queue. The CRW (controlled random walk) queue has the same form, 
but with the introduction of an input: 


X(k +1) =X(k) — S(k + 1)U(k) + A(k +1) (7.29) 


def 


where N(k) = (S(k), A(k)) is an i.i.d. sequence, and S(k) has a Bernoulli distribution. It is assumed 
that X = Z, = {0,1,2,...} and U = {0,1}. The input process is interpreted as the sequences of 
times at which the server is busy. It is subject to the constraint U(k) € U(X(k)) for each k, where 
U(x) = {0,1} for x > 1, and U(0) = {0} (the server can’t work if there is nobody in the queue). 
Example 6.2.3 provides a simple example in which A is an i.i.d. Bernoulli process with parameter 
a. Setting S(k) = 1— A(k) for each k, it follows that S is also an i.i.d. Bernoulli process with 


parameter 44 = 1— a. The MDP in this special case is the controlled M/M/1 queue: 


(i) The controlled transition matrix: 


ao 2 =e4+1 
Pi(z,e)=<p ef =r=4 (7.30) 


0 else 


(ii) A standard cost function is c(a,u) = x for each x, so that E[c(X(k))] = E[X(k)] is the mean 
queue length at time k. 

(iii) A typical objective in queueing network applications is average cost (7.2), which requires 
p=aje <1. 

The M/M/1 queue is obtained with the non-idling policy, U(k) = o*(X(k)), with o*(x) = 1 for 

x > 1. This is average cost optimal, resulting in n* = p/(1 — p). 


Fluid models and value functions ‘To define the fluid model for the CRW queue we begin 
with a convex relaxation, taking X = R;, and U = [0,1]. The mean-field dynamics and mean-field 
vector field associated with (7.29) are 


F(z, u) = E[X(k) — S(k + 1)U(k) + A(k +1) | X(k) = 2, U(k) =U 
=x-put+a 
f(z,u) = —wuta 
Shown on the left hand side of Fig. 7.1 is a copy of the left hand side of Fig. 6.1, and on the 
right hand side the evolution of the fluid model sq = f(q,u), under the non-idling policy u = 1 
when q > 0, and u = p = a/p otherwise. 
Let’s now compare three value functions under the respective non-idling policies, and with cost 
function c(#,u) = x. 


(i) It is easy to verify that a solution to the ACOE (7.5) is given by 


2 
h*(c) = 32 ** (7.31) 
pb—a 
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Figure 7.1: On the left is a sample path X(k) of the M/M/1 queue with p = a/p = 0.9, and X(0) = 400. On the 
right is a solution to the fluid model equation £4q = (—y+ a) for gq > 0, starting from the same initial condition. 


(ii) The fluid value function in continuous time is easily obtained: 


io) Ww 
(a) = i qa dt = i qgdt = 5W x H (area of a triangle), 


where “H” refers to the height of the triangle defined by the linear path of q, and “W” the 
width, which is the time to reach zero. A glance at the plot on the right hand side of Fig. 7.1 
should convince you that H = gg = x and W = x/(u— a), so that 


[b—a 
We can compute the function 7(x, u) appearing in (7.28b): 


n(x, u) = E[U*(X(k + 1)) — I*(X(k)) | X(K) = 2, U(k) = ul] — VU* (2) - f(a, u) 
_ 1 uera 


2a 


This is bounded, and independent of x. It follows from Lemma 7.3 that J* almost solves the 
ACOE for the MDP model. 


(iii) Treatment of the fluid model in discrete time is a bit more complex: 


(a) The fluid model becomes x(k + 1) = F(a(k), u(k)) = x(k) — pu(k) +a. 
(b) The optimal policy is maximal, subject to the constraint that x(k) > 0 for each k: 
b* (x) = min{1, (x + a)/p} 
This results in x(k + 1) = 0 when 2(k) < w—a. 
(c) The total cost is given by 
£ wcp-a 
J* (x) = ‘ 2 
2a 
This follows by direct computation, or verification that this function solves the DP 
equation J*(2) = min, {c(a,u) + J*(F(x, u))}, with boundary condition J*(0) = 0. 


+ 5x otherwise 


The strong solidarity of the value functions in this example is the product of two factors: the 
cost function c is Lipschitz continuous, and the MDP is “skip free” in a mean-square sense: 


max E||| X (k +1) — X(k)||? | X(k) = 2, U(k) =u] < co 


The example considered next violates the skip-free property. However, the continuous time fluid 
model approximation is accurate because V.J* vanishes as the state tends to infinity. 
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7.4 Speed Scaling 


Dynamic speed scaling in computer processors are algorithms that adjust processing speed in re- 
sponse to environment. While initially proposed for processor design [25], the ideas have impacted 
other application areas such as wireless communication. 

Dynamic speed scaling is modeled as an MDP in discrete time, using a variation of the CRW 
queueing model. For each k = 0,1,2,..., let A(k) denote the job arrivals in this time slot, X(k) 
the number of jobs in the queue awaiting service, and U(k) the rate of service. The evolution of 
the state is then 

X(k+1) = X(k) —U(k) + A(k4+ 1), k>0, (7.32) 


It is assumed that the arrival process A is i.i.d., so that this forms the dynamical system for an 
MDP model. 

One source of complexity comes from integer constraints on the state and input: each evolve 
on the non-negative integers X = U = Z, © {0,1,2,3,...}, and the constraint U(k) < X(k) is also 
imposed, so that U(x) = {u€ Z, :u< zc}. 

The arrival process also evolves in Z;. The common mean and variance of A(k) are assumed 
finite, with mean denoted a. Assume moreover that P{A(1) = 0} > 0. Under this assumption, 
there is a stationary policy under which X becomes a Markov chain that is x*-irreducible with 
x° = 0 (see (6.57) and surrounding discussion). A simple example is 


U(k) = o°(X(k)) = X(k), 


so that X(k) = A(k) for k > 1. 
The cost function we consider balances delay with power consumption: 


c(z,u) =x+rP(u), 


where P denotes the power consumption as a function of the service rate u, and r > 0. We are 
then led to the ACOE to obtain an optimal policy, 

A*(c)+7* = min {c(x,u) + Pyh* (z)} 
u€U(a) 


= min {2 + rP(u) + Ele — a+ A(a))} 


(7.33) 


Here we will stick to quadratic cost for u, which is alleged to be a good approximation for computer 
processors. 


Fluid model We cannot possibly solve the ACOE for the stochastic model (7.33), or even obtain 
any intuition in this discrete-stochastic world. In this example we find that optimization of the 
fluid model in continuous time provides a near perfect approximation of the ACOE solution. 

We henceforth abandon integer constraints, and focus on the scalar state space model 


out =-uta (7.34) 


where a is the expectation of A(k), and the processing speed u,; and queue length 2; are each 
non-negative. For any cost function c: Ry, x Ry — R41, the associated value function is denoted 


a int [ C(Xz, ut) dt, to= KER, 
0 


U 
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The HJB equation is then 


Hi min tela, u) + (-ut+a)tJ* (x)} ces 


This is similar to the integrator model with polynomial cost considered in Section 3.9.4 (see (3.62)). 
160 


140 
120 


Figure 7.2: Comparison of optimal policies for the fluid and MDP models. 


Computation & Solidarity Consider the following choices for P(u) in the definition of J*: 
Case 1: P(u) =u? In this case c(x,u) never vanishes, so that the total cost J*(x) is never finite 
valued! 

This challenge is resolved by considering instead the (weighted) shortest path problem (SPP): 


To 
K* (a) = min f C(x¢, Ur) at, (7.36) 
0 


U 


where % = « € R4, To = min{t : 2, = 0} is the first time that x; hits the origin, and the minimum 
is over u as before. The function K”% is finite-valued and solves the HJB equation (7.35). 

We next obtain bounds on K* by considering simple policies. For example, if we take uz = pu 
whenever x; > 0 then (7.34) turns into the fluid model for the CRW queue. Assuming that p > a, 
this policy is stabilizing, as seen by solving the state equation: 


Lt = Xo — (w—ayt, 0<t<T%] 


where Tp = (44 — a)~!2q. Integrating gives, 


2 


(osu) f° (eet : 2_* 
K(x; ps = x,+ru;) dt = 5 +r ; wos 
0 ea ree pa 


This implies that K*(x) can grow no faster than a quadratic. 

We can obtain a tighter bound by allowing the input to depend on more information: consider 
the control law uz; = p*(x) for x, > 0 with xo = x, where p*(x) is the minimum of K(x; 4) over 
pu >a. The first order condition for optimality gives 

a gt 


d 
a rrr cate rare 


1 
=—(52? + rele?) + Ore 
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where * = pu*(ax) in each appearance. This becomes a quadratic equation in yu after multiplying 
both sides by (u* — a), and can be solved to obtain 


1 
Jaz +a/r 
We have K*(x) < K (2; u*(x)), which implies that the growth rate of K* is no faster than x°/?. 

This bound is relatively tight, as we now see by solving the HJB equation (7.35): for x > 0, 


2 L \d 
0 = min(x + su’ + (—ut+a)#kK*(z2)) (7.37) 


(et) =at+Vetalr,  K(xp*(2)) = (42? + re[y*(2))?) 


This is a first order ODE with boundary condition K*(0) = 0. Its solution with r = 5 is 


K*(x) = ax + $[(2r + oP Pl? Se", (7.38) 
Assuming that u* 4 0 in (7.37), the first-order optimality conditions give the optimal policy: 
O= a(t gut (-uta)gK*(2)) 


u* = o**(2) = 4, K*(a) is indeed the optimal policy for 2 > 0, as the derivative is non-negative 
for all x. 

Case 2: P(u) =(u—a)? This modification of the cost function leads to a far simpler expression 
for the SPP, which coincides with total cost: we now have c(x°, u“) = 0, where x® = 0 and u® = a. 
For simplicity we maintain r = 5: The HJB equation (7.35) is similar to (7.37): 


0= min (x + 3(u—a)? + (-uta)tJ*(z)) (7.39) 
UA 
Assuming that u* 4 0, we apply the first-order optimality conditions as before to obtain 
0= (x - $(u —a)?+(-ut+ a) 4 J*(x)) = uv =at # J* (x) 
The closed loop dynamics can be expressed as gradient descent: 


dk x _ d Jx/(..* 
50, = —u +a=—Z J" (a7) 


Computation of J* proceeds as before: solving (7.39), subject to the boundary condition J*(0) = 
0. Let’s guess the solution as J*(x) = bp~'x? for some p > 1 and b > 0. We obtain the following 
conclusions: 


(i) The minimizer in the HJB equation (7.39) gives the policy for the fluid model: 
o"@) =a4+2F@)=a+0e 
(ii) Substitution of u* = **(x) into (7.39) gives, 
O=(2 + 3[£I* (a)? — [2 U*(a)??) = 2 — bb?) 
This is only possible for b = 2 and p = 3/2 (consistent with (7.38)). 
(iii) For general r > 0, the value function and optimal policy for the fluid model are given by 
Its) = Sr /2q3/2 and = @* (x) =atV/a/r. (7.40) 


Note that 2 and [p**(x) — a]? are each linear in x. This reflects the balance between state 
and control cost. 
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Figure 7.3: The convergence of value iteration for the quadratic cost function. 


Solidarity A numerical example illustrates the strong solidarity between the MDP model and 
its fluid oe approximation. 

Take r = 5, and choose the marginal distribution of the arrival process to be a scaled geometric 
distribution: 


5; 
A(k) =A4G(k),  k>1, (7.41) 


where A, > 0 and G is geometrically distributed on {0,1,...} with parameter p4. The mean and 
variance of A(k) are given by, respectively, 


PA 


2 PA 2 
; o”4 = ——— A. 7.42 
joa 2s A (7.42) 


(1 — pa)? 


with pa = 0.96 and A, = 1/24 chosen so that the mean my is equal to unity: 


l=m,=Ay4 si (7.43) 


You can read about the value iteration algorithm (VIA) in Appendix B.2, which is nearly 
identical to the definition for deterministic control systems seen in (3.8). In particular, the algorithm 
generates a sequence of functions {V,}, each of which can be interpreted as a value function for a 
finite-horizon optimal control problem: 


Va (a ) = min ES (x k)) + Vo(X(n)) | X(0) =a (7.44) 


k=0 


See (3.9) for the deterministic counterpart. 

Shown in Fig. 7.2 is a comparison of the optimal policy, computed numerically using value 
iteration, and the optimal policy for the fluid model. Recall that the constraint U(k) < X(k) is 
imposed in the stochastic model. This constraint is imposed in the policy shown in the figure, so 
that 

7" (x) = min(2, 6**(2)) 


For MDP models the sequence of differences {h,(x) = V;,(x) — V,(a*)} are approximations of 
the relative value function. The convergence of {h,} to h* is illustrated in Fig. 7.3 using 2° = 0. 
The error ||hn4i —An|| is much smaller for initial n when the algorithm is initialized using the fluid 
value function. 
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7.5 LQG 


The discrete-time linear system equations are defined by the linear state space model of Sec- 
tion 2.3.3, with the introduction of disturbance process N and observation noise W: 


X(k+1) = FX(k) + GU(k) + N(k+4+1) 

Y(k) = HX(k) + W(k) 
The joint process {N(k),W(k) : k > 0} is assumed to be independent and identically distributed. 
The process N represents a system disturbance, in the sense that it is uncontrolled, and impacts the 
state (whose entries represent physical quantities). The variable W(k) corrupts our measurement of 


HX(k) at time k. It is assumed that each are zero mean processes, with finite covariances denoted 
“un, Uw, respectively. 


(7.45) 


Fluid model dynamics Under the assumption that the disturbance has zero mean, the fluid 
model in discrete time is the linear state space model introduced in (2.13a): 


a(k+1)=Fa(k)+Gu(k), k>0 


Before turning to control comparisons, let’s compare the dynamics of the two models without 
control: 
a(ik+1)=Fa(k) and X(k+1)=FX(k)+N(kK+4+1), k>0 (7.46) 


Consider the special case used as an example in Section 2.4.6, with 


—0.2, 1 
leas er 7 


The fluid model in continuous time is defined by 


fa = = f(x) =0.02Ar (in this example, with u = 0). 


This is entirely consistent with the construction of F’ (which was based on an Euler approximation). 

We can now explain the nature of the disturbance process used to obtain the plots shown in 
Fig. 2.7: N was chosen i.i.d. and Gaussian, with rank-deficient covariance: N(k) = gV(k) with 
V(k) scalar N(0,1) and g = 2.5(7). Trajectories of the two state processes are shown in Fig. 2.7. 
A careful look at the figure on the right reveals the degeneracy of the noise, which influences the 
state directly in directions +g. 


DP equations The linear dynamics in (7.45) are the explanation for the “L” in LQG, and the 
“Q” appears because we adopt the quadratic cost appearing in the deterministic LQR framework: 


c(2,u) = 27 S2+ul Ru (7.47) 


with S > 0 and R > 0. We postpone explanation of the “G” because it is irrelevant until we explain 
why we care about the measurements Y. 

The infinite-horizon objective is typically infinite, so let’s consider first the finite-horizon prob- 
lem 


I(x) = E|x (N)T Ro X(N) + vx k)'SX(k) + U(k)™RU(k)) | X(0) =2 (7.48) 


with Ro > 0. The minimization of J is easily obtained as linear state feedback: 
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Proposition 7.4. (i) The minimization of (7.48) over all possibly non-linear policies is 
obtained using linear state feedback U(k) = —K,X(k), in which the feedback gain matrix is 


Ky =(R+G™M,41G] 'GTMy yi F 


where {M,} is determined by the matrix Riccati difference equation, that runs backward in 
time: 
M;, = F" (Mien — MgyiG (GTMyy1G + R)* G™Mes1) Pas, 


with terminal condition My = Ro. 
(ii) The value function is quadratic: Jx;(x) = Jx-(0) + 27 Moe. 


(iii) The following limits exist: 


The pair solve the ACOE (7.5), for which n* is the optimal average cost. The average-cost 
optimal policy (7.6) is linear state feedback: 


*(c)=—K*x with K* =[R+G™M*G]1G™M*F 
where M* is the solution to the ARE (algebraic Riccati equation) (3.40): 
M* = FT (M* — M*G(R+ G™M*G) 'G™M*) FF +S O 


A remarkable take-away from this result is that the solution to the finite-horizon and average- 
cost control problems do not depend in any way on the distribution of the disturbance, beyond 
the assumption that they are zero mean with finite variance, and the standing i.i.d. assumptions 
(even the independence assumption can be relaxed, subject to a restricted definition of optimality 
(81, 7]). In addition, the optimal policy for the average cost optimal control problem coincides with 
the optimal policy for the associated fluid model (in discrete time) with total-cost criterion: recall 
(3.39). 

The only reason to impose assumptions on the distributions is when we come to the next topic. 


Partial Observations If only Y is observed then we may attempt to minimize J over all func- 
tions of these observations. That is, restrict inputs to functions of current and past observations: 


An explicit solution can be obtained under the assumption that (N,W) is jointly Gaussian: 


Proposition 7.5. Suppose that (N,W) is jointly Gaussian, and independent of X(0). Then, 


(i) The solution to the finite-horizon optimal control problem, in which J is minimized over 
all input sequences of the form (7.49), is obtained using 


U(k) = —K;,X (k) 


The gain K;, is the same matrix sequence introduced in Prop. 7.4. 
The estimate X(k) is defined recursively as follows, based on this important fact: the 
conditional distribution of X(k) given Yf° is Gaussian N(mx, Up) in which 
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(a) The conditional mean mz = X(k) evolves as the time-varying linear system, 
X(k+1) = FX(k) + GU(k) + Levi [Y (RK +1) -YV(k +1] k)] 
Y(k+1|k)=HAX(k+1|k) =HA[FX(k)+GU(k)], k>0 
with initial condition 
X (0) = E[X(0)] + Lol¥ (0) — HE[X(0)]] 
(b) The filter gains are defined by 
Le = Dep ieH [Sw + Hei.) *, Bae] 


(c) The conditional covariance does not depend on the observations, but evolves according 
to a deterministic Riccati recursion: 
Unie = PUP + Un 


“4 (7.50) 
Uet1 = Uepijk — UepajeH ([HUepij,H! + Uw] ~ AXesaye 


with initial condition 
Zo-1 = E[XO)X(O)"],  -X(0) = X(0) - X(0) 


(ii) The solution to the average-cost optimal control problem, subject to the constraint (7.49) 
on the input sequence, is obtained using static linear feedback 


nN 


U(k) = —K*X(k) 


where X(k) is the conditional mean, as in (i). O 


One step in the proof of the proposition is that the cost can be expressed in terms of the new 
state X as follows: 


(k)TS. 


(k) + U(k) 
(k)'SX(k) +U 


(k) 
in which the tilde denotes error: X(k) = X(k) — a (k). The last term is independent of control: 
ELX (k)'S.X (k)] = trace (D;S) 


TRU (k)| 


XxX 
x (k)™RU(k)] + ELX(k)7S.X(k)] 


This leads to the separation principle: The optimal input is the naive “plug in” control law: 
U(k) = b*(X(k)) = K*X(k) (7.51) 
where K™* is the feedback gain obtained for the control problem with full observations. 

This optimal control solution is known as certainty-equivalent form. Please don’t forget that 
the certainty-equivalence conclusion is very particular to the assumptions imposed. It may be 
tempting to use the policy U(k) = *(X(k)) in more general settings when the state is only 
partially observed. In general the optimal feedback law requires far more information regarding 
the conditional distribution of the state process. Appendix C contains a derivation of the optimal 
policy for more general partially observed control systems. 


The next examples are intended to explore the role of partial information, and the creation of 
a “belief state”. 
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7.6 A Queueing Game 


We return now to the two-station network model with four dimensional queue-length processes 
introduced in Section 3.9.3. 

An MDP model can be obtained by analogy with the fluid model described there. We might 
opt for an extension of the CRW model for the single queue (7.29), in which the four buffers evolve 
in discrete time as follows: 


Xi(k+ 1) = X1(k) = Silk + 1)U,(k) + Ai(k + 1) 
X2(k +1) = X2(k) — S2(k + 1)U2(k) + Si(k + 1)U1(k) 
X3(k +1) = X3(k) — S3(k + 1)U3(k) + A3(k + 1) 
X4(k +1) = X4(k) — S4(k + 1)U4(k) + S3(k + 1)U3(k) 


in which each S;(k) is Bernoulli with parameter j;, and each A;(k‘) has mean a; and finite variance. 
To be a controlled Markov chain it is necessary to assume that the six dimensional stochastic 
process (S$, A) is i.id.. The input process is subject to the same constraints as the fluid model, but 
in addition it is assumed that U;(k) takes on values zero or one for each i and k. Solidarity between 
the fluid model and this MDP model is investigated in [159], where you can find numerical results 
similar to what was obtained for the single queue. 

We can now explain the simulation shown on the left hand side of Fig. 3.6: the random variables 
Si(k), Aj(k) are each Bernoulli, and highly dependent in the sense that only one is non-zero at 
each time k. 

Let’s consider a problem we cannot solve using the theory in this book: restrict each input to 
be a function of only local information. This is a game, because the decision at a station is based 
on only the history of buffer levels at that station. We consider a cooperative setting, in which the 
two players (the stations) have a common goal of minimizing the long run average cost. 

By imposing symmetry we can simplify the problem by imposing the following constraint: 
denote X'(k) = (X1(k), Xa(k)), X"(k) = (X3(k), X2(k)), and assume the input is of the form 


Ui(k) = ba(X"(k)) — Ua(k) = b5(X"(K)) 
Us(k) = bo(X"(k)) Usk) = ba(X"(k)) 


Our goal is to obtain * = (4, ;) that minimizes the average cost, with c(w,u) = ))a;. The 
information structure may be motivated by the desire to reduce communication cost, which is 
definitely the case in applications to global supply chains. In this case the network models a 
single business, with geographically separated manufacturing. Non-cooperative games can also be 
addressed using the concepts introduced here. 

This is a crude approach. In particular, why not allow the U;(k) and U4(k) to depend on 
the history of local observations, {X1(i) : i < k}? Appendix C contains hints at how one might 
construct a sufficient statistic for control. 

The information constraints in this example destroy any justification for the application of the 
dynamic programming equations of MDP theory, but we may still apply MDP algorithms and see 
if we get lucky. In this simple example we do get lucky. 

Here is an approach that leads to a successful control solution in this example: The goal is to 
obtain a sequence of approximating MDP models with state process X! and input process U!, with 
U'(k) = (U,(k), Us(k)). We simultaneously obtain an identical sequence of MDP models with state 
process X"™ and input U"(k) = (U3(k), Uo(k)). 
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Initial Polic 


Gis Final Polic 


P{U;(t) = 1} = 0.85 


10 20 30 40 50 Qa 10 20 30 40 50 Qa 


P{U;(t) = 1} = 0.15 


Figure 7.4: Distributed control in a two station queueing network. 


. : 20 yO yO 
The procedure requires initialization: choose a randomized policy 6 = (,,,) so that the 
resulting Markov chain X is positive recurrent. Then follow these steps for m = 0 to M, with 
M > 1 fixed, or determined by some stopping criterion: 


(i) System identification: Simulate the Markov chain using policy b” to obtain an estimate of 
the steady-state distribution 


for each z = (21, 22), 2/ = (21,24) € Z2, and v € {v°, u', v7} e ys ‘7 Gk Based on this, 
define a controlled transition matrix: 


1 


PO (z, 2!) = ——_ 
2) = 7 Ga) 


On (z,v,2'),  m(z,v) = >) or (z, 0,2’) (7.52) 
z! 


(ii) Cost identification: Include in your simulation the following estimate 


(iii) Policy update: Solve the ACOE with controlled transition matrix P” and cost function 
c™ to obtain a policy p”*!. 


v 1 ‘ ; 
Define ot to be a randomized policy approximating 6™*!, and go to step (i), with m 
incremented to m+ 1. O 
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The system identification step is motivated by Bayes’ rule. Suppose that Z(k) = X'(k) were 
indeed an MDP model with input V(k) = (Ui(k), Us(k)). Its controlled transition matrix is then 
defined by the ratio of probabilities: 


P,(z, 2’) = P{Z(k +1) =2'| Z(k) =z, V(k) =v} 
1 
~ P{Z(k) =z, V(k) =v} 


P{Z(k+1)=2',Z(k) =z, V(k) =v} 


The cost identification step has similar motivation. For large n we have the approximation 
eh) (z, v) © Ele(X(k)) | Z(k) =z, V(k) =] (7.53) 


So that the steady state mean of c”)(Z(k), V(k)) approximates the steady-state average cost. 
Shown on the right in Fig. 7.4 is an illustration of a policy obtained using this approach. The 
initial policy was chosen to be a perturbation of the “serve the longest queue” policy: 


v0 1 0.85 if ZY > 22 
(v |z)= 
0.15 else 


with b (v? |z)=1- é(v! | z) (provided z2 > 1). 

The final policy shown on the right in the figure is more similar to a threshold policy of the 
form “Serve buffer 4 whenever X, < %, and X4 > 1”, where %, ~ 10. The resulting performance 
is close to that of the centralized optimal solution [255]. 

See the cover of this book for a clearer view of the policy, which shows the plot on the right in 
Fig. 7.4 before it was converted to grayscale. The color indicates the probability that U;(k) is equal 
to one (which is one minus the probability that U4(k) is equal to one, provided X1(k) + X4(k) > 1). 
The dark blue indicates a value of approximately 0.1, and dark red approximately 0.9. 


7.7 Controlling Rover with Partial Information 


We now turn to a setting in which information is limited, but the theory and algorithms in this 
book are directly applicable. The ideas are illustrated with an extremely simple example. 

In past stochastic control course offerings at the university of Florida, the following question 
is asked: How to control a rover on Planet Pluto? Planetary exploration is frequently on our 
minds—the Kennedy Space Center is only about 150 miles away from Gainesville, Florida. 

The setting: an autonomous rover on Pluto depends on its solar panels for energy, but there 
is not a lot of energy.!© Its goal is to collect as much energy as possible, but it is a clumsy. It 
is close to a small hill and collects the most energy when it is at the top of the hill, but it has a 
tendency to roll off the hill, and it then takes energy to get back up. 

The model: rover can be in one of three states: top of the hill, rolling down the hill, or at 
the bottom of the hill: X = {T,B,R}. The input space is U = {D,B}, corresponding to the actions 
“drive” or “don’t drive”. 

The state dynamics are summarized in the following three cases, distinguished by location: 

1. Rover on top. If driving, then it is still at the top of the hill in the next period with probability 
0.8, and rolling down in the next period with probability 0.2. If not driving, these probabilities are 


'Snttps: //science.nasa. gov/science-news/science-at-nasa/2002/08jan_sunshine 
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0.75 and 0.25, respectively. 
2. Rolling. If driving, then with probability 0.9 it is at the top of the hill in the next period, and 
with probability 0.1 it moves to the bottom. 
If not driving, then with probability 1 it is at the bottom in the next period. 
3. Bottom. If driving, then it remains at the bottom with probability 0.9, and is rolling in the next 
period with probability 0.1. If not driving, it remains on the bottom with probability 1. 
Cost: Rover wants to maximize energy. Energy collected (reward) depends on position and 

action: 

Action: \. Position: top rolling bottom 

driving 1 0 0 

not driving 3 0 0 
A realistic model would include energy storage. The example is simplistic for many obvious reasons 
beyond the lack of storage, but a good example to illustrate a few basic ideas. 


Control with partial observations The input U is limited to noisy observations of the state 
process X. This is modeled by an observation sequence Y, which takes on values zero or one. An 
observation can only can tell us if we are at the top of the hill: Y(k) = 0 if X(k) =R or X(k) =B. 
If X(k) =T, then Y(k) = 1if U(k—1) =D, and Y(k) = 1 with probability @ € (0,1) if U(k—-1) =P. 
In other words, 


Y(k) =1{X(k) =T, U(kK-—1) =D} +1 (K)1{X(k) =T, U(k—-1) =D}, (7.54) 


where {I'(k)} is an i.i.d. Bernoulli sequence with parameter 9. 
This is known as a POMDP model (partially observed MDP). It is a remarkable fact that an 
optimal control solution is obtained by state feedback, provided we modify the definition of “state”. 
In the POMDP literature we introduce the belief state ¥(k), which 
T 2 is nothing more than the conditional pmf for the state X(k), given 
az observations up to time k. In the example considered here, the belief 
state evolves on the simplex S (a two dimensional region in R3.), with 

the interpretation 


R 
B X,(k) = P{X(k) =2| Y(0),...,Y(kK)}, 2 EX={T,B,R}. 
ae For the average- or discounted-cost criterion, the optimal policy can 
Figure 7.5: A candidate be represented as state feedback: 
policy for Rover with par- 2 ne 
tial observations U (k) = (X(k)) 


where b*: S — U. The LQG solution described in Prop. 7.5 is an example of this construction in 
which the belief state can be summarized by the conditional mean and conditional covariance. 

A brief survey of POMDP theory is found in Appendix C, where you can find a recursive update 
formula for the belief state: 


X(k+1)=M(X(k),V(k+1),U(k)), k>0 


with the details of the update map M provided in Prop. C.3. 

Computation of d* is a major challenge in POMDP practice because the state space S is never 
finite. We are left to look for approximation techniques, and reinforcement learning provides many 
tools for this purpose. 


Pre-publication draft -- March 25, 2022 


CHAPTER 7. STOCHASTIC CONTROL 263 


Suppose that the solution to the fully observed problem is defined by U(k) = B if and only if 
X(k) =T. The policy shown in Fig. 7.5 is then well motivated in the partially observed setting. This 
was obtained based on the following threshold formula, given non-negative constants ap,aR,ap,r: 


F bEeS 
D else 


B if arlbr— 1)? + apd? +apb2 <r 
p(b) = [ T | R B 
The degenerate case using r = 0 is not excluded, giving @(b) = B only if by = 1. 
This example and the example in Section 7.6 expose several challenges and opportunities: 


(i) Information constraints on the input can render MDP theory inapplicable, but MDP algo- 
rithms may still lead to a useful policy. 


(ii) If the POMDP assumptions are valid, then the theory is very rich, but direct application of 
computational tools such as VIA is a significant challenge. Reinforcement learning is a great 
alternative. 


(iii) RL + POMDP is not model free, since the nonlinear filter that generates the belief state 
depends crucially on a model. In practice we either use a model, or replace the belief state 
with some other “features” to create a “pseudo state” that is believed to include enough 
information for reliable control. 


7.8 Bandits 


Our next partially observed control problem is known as a multi-armed bandit, which is a great 
vehicle to explain “exploitation” and “exploration” in reinforcement learning. As for the name 
bandit, consider a room full of slot machines, with different expected profit (or loss) on pulling an 
arm. This is called a K-armed bandit if there are a total of kK distinct arms. 

Other applications of varying degrees of usefulness are 


A Game Playing 
Exploitation: Play the move you believe is best 
Exploration: Play an experimental move 


A Restaurant Selection 
Exploitation: Go to your favorite restaurant 
Exploration: Try a new restaurant 

A Oil Drilling 
Exploitation: Drill at the best known location 
Exploration: Drill at a new location 


A Online Banner Advertisements 
Exploitation: Show the most successful advertisement 
Exploration: Show a different advertisement 


How would you play one of these bandit games? 
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7.8.1 Bandit models 


There are many models that capture the bandit theme. Here is one that may be regarded as a 
degenerate MDP model. 

For the K-armed bandit, we assume the existence of something like a state process, denoted Z, 
that evolves on R*. In the simple model described here, it is assumed that this sequence is i.i.d. 
with finite mean. The input evolves in the action space U = {1,..., A}, and has no impact on Z. 
The reward received at time k is 

K 
R(k) =) 1{U(k) =u} Zu(k) 
u=1 
That is, R(k) = Z,(k) if U(k) = u. An input sequence U is admissible if U(k) is a (possibly 
randomized) function of the observed rewards {R(j) : 7 < k} for every time k. Our goal is to 
construct U for which the average reward is maximized. 
Let’s consider two extreme cases: 


(i) If the “state” Z(k) is observed at each time k, then obviously 


U*(k) = arg max Z,,(k) 
1<u<K 


This is optimal for any criterion: finite-horizon, discounted, or average-reward. However, we 
only observe the rewards received {R(j) : 7 < k}, so this policy is not feasible. 
(ii) The infinite-horizon optimal reward is defined by 


N 
* : 1 
nn = max Jim. W > R(k) 


Denote 7,, = E[Z,,(k)] (assumed independent of k), and 7* = max,7,. An optimizing policy 
is given by the open-loop strategy, 


U*(k) = arg max Fy, 
1<u<K 


This is not feasible unless the means are given. 


We can of course estimate {7 : 1 <u < K} using Monte-Carlo, which is an example of ex- 
ploration. However, for any action u € {1,..., A}, an accurate estimate of 7,, will likely require 
frequent selection of this action. In the big-money applications of bandit theory (such as in ad- 
vertising), we are learning as we attempt to maximize profit. This means we must minimize the 
time spent exploring with inevitable sub-optimal actions, which motivates a finer notion of optimal 
reward: 


Regret: For a given time horizon NV, the regret is defined as the sum 


N 
De = i Sey (7.55) 
alt 


Under mild additional assumptions, its mean grows logarithmically in NV for the best policies. 
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7.8.2 Bayesian bandits 


If we can formulate the bandit problem within an MDP setting, theory in Appendix C predicts 
that we can express an optimal policy as a form of state feedback: 


U*(k) = Oy-n(4(k)),  OSkSN (7.56) 


The belief state *(k) coincides with the conditional distribution of the state at time k, given 
observations up to time k. This requires a Bayesian setting for which the reward is some randomized 
function of the state. As a hopefully simple starting point, let’s turn to a linear-Gaussian model 
for which the belief state is finite dimensional. 

In the Gaussian bandit we create a state process evolving on R* that is static: X(k) = X(0) 
for each k. While static, the state can be modeled by the linear dynamics 


X(k+1)=X(k), 20 (7.57a) 


The vector X = X(0) is assumed random, with Gaussian distribution. The process Z is also 
assumed Gaussian: 
Z(k) = X(k) + W(k), 


in which W is an i.i.d. Gaussian stochastic process with zero mean, and independent of X. The 
reward received at time k is interpreted as an observation equation: 


Y (k) = R(k) = H,.X(k) + HW (k) (7.57b) 


where H;, is the K-dimensional row vector with entries H;,(7) = 1{U(k) = i}. 

The pair of equations (7.57) looks like a special case of (7.45), in which the input U(k) is 
hidden in the row vector Hz, and the observation noise has a slightly different form due to the 
multiplication by H; in (7.57b). The conclusion of Prop. 7.5 stands: the conditional distribution of 
X(k) given all observations up to time k is Gaussian, and the filtering equations simplify greatly. 

For this model we can replace the belief state by the conditional mean and covariance that 
define the conditional distribution. That is, we can take V(k) = {X(k), ©,} in the state feedback 
architecture (7.56). The dependency on both the conditional mean and covariance invites questions: 


> Policy structure: How might b* depend upon the sufficient statistic {X(k), Xz} at time k? The 
certainty equivalent policy, generalizing (7.51), is 


U(k) = arg max X,(k) (7.58) 


U 


This coincides with )},;_,(4(k)) in a special case: &;, = 0, so that X,,(k) = X, for each u. 

Conversely, surely this is a terrible approach when the conditional covariance is far from zero, 
so there is significant uncertainty in the reward anticipated from one or more arms. We need to 
explore when there is high uncertainty. 


> Exploration: How much is required? The following are anticipated for a good policy, in which 
the time-horizon is not bounded 


(i) XS, 4 0 as k > co (uncertainty may lead to excessive sub-optimal pulls) 


(ii) The rate of convergence of ©;(i,i) to zero will be slow if X; is less than X* = max; X;. 


Pre-publication draft -- March 25, 2022 


CHAPTER 7. STOCHASTIC CONTROL 266 


The regret grows logarithmically in WV if and only if n;(k)1{X; < X*} grows logarithmically in 


k for each 1, where 
k 


ni(k) =) 1{U(j) = 3} (7.59) 
j=l 
We might then ask, is there a conflict between exploration and exploitation? If ni(k) is con- 
strained to grow logarithmically in k for “bad” i (exploitation), how can we be sure that ¥;,(7, 7) 
vanishes as k — oo (exploration)? A careful look at the evolution of the conditional covariance 
shows that there is no conflict. Since F = J and Sy = 0, the update equation for /; is both simple 
and insightful: 


Proposition 7.6. The covariance evolution for the Gaussian bandit is summarized as follows: 
(i) Seyi, = Ux for each k > 0 
(ii) The inverse of the covariance has the representation 
y= hte (7.60) 


in which Dx is a diagonal matrix with entries 
Bia L 
Dy (i,t) = ——ni(k) 
OW, 
where ojy, is the variance of the Gaussian random variable W;(k) (independent of k). 
Consequently, “, > 0 as k > oo if and only if each arm is pulled infinitely often. 


Proof. Part (i) follows from the first equation in (7.50), and from the second we obtain the update 
equation 


Y= Yea = VV" with V=SpiH}, y = AeDe-1H] + He=w He 
The Matrix Inversion Lemma gives 


- 1 _ 
uy, = Spat epaViy=ViS Vv) Vey 


which simplifies dramatically when we insert the definition of V and y: 


= Se + AT Ae 


Ay, Uw HA, 
The second term is a matrix with all entries zero, except a signal term on the diagonal. Specifically, 
for all 7,7, 
o, 1 (,5) = 9) i _ a 
1,2) + ze) =4) 4S 7 


which establishes (ii). O 


The proposition implies that to compute the conditional covariance, it is sufficient to keep track 
of the number of times that we pull each arm. Moreover, 


(oe) 


1 
i: Ae. ge & : 
jim, =, (0,2) = Xp G1) + e, > 1{U(k) =i} 


Hence a logarithmic upper bound on n;(k) does not prevent convergence of U;,(i,7) to zero. 
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7.8.3 Naive optimism can be successful 


Bandit theory is typically posed in a setting far more general than Gaussian bandits, and algorithm 
design is often based on a frequentist setting. That is, algorithms are based on empirical pmfs, and 
there is no use for a prior distribution on the rewards. 

A model is required for analysis. A typical choice is a parameterized model in which the 
vector-valued process Z is i.i.d., with 


Zulk) ~ F( +5 Ou) (7.61) 
in which f(-;6,) is a density on R for each u € U, with parameter 6 € R®. Hence for each u, 
Ty = El Ak) = | fer:60 dr 
Once again, we are not interested in estimating 0,, if 7, <7, but we don’t know in advance which 
arms are sub-optimal in this sense. 


The frequentist will consider alternative representations of the regret (7.55). The following is 
most valuable in obtaining sharp bounds: 


E[Lv] =N > EL{r* — Fufin(u)] 


in which {y(u) is the empirical pmf: 


N 
fiv(u) = om) = 37 MU(H) =u}, WE (7.62) 
k=1 


The goal of optimal control is to steer the empirical pmf so that {7* —7,,}ty(u) 0 for each w. 
To avoid taking expectations we might use the representation 


Ly = Sof — FN) } (WV) (7.63) 
in which the “reward estimates” are defined by 
1 
Pull) = op S- 1{U(j) = u}Zu(J) 


j=l 


The certainty equivalent policy U(k + 1) = arg max,,7,,(k) is again doomed to failure. However, a 
small tweak results in a very good policy. 
The idea is to define recursively a sequence of positive scalars {0,(k):k >1, 1<u< K}, 
denote 
UCB, (k) = Pulk) + Ou(k) 


and ensure that the policy is designed so that these values serve as upper confidence bounds in the 


sense that 
jim PIUCB Ak) >?) = (7.64) 
— 00 
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One version of the UCB algorithm is the simple optimistic rule: 


U(k +1) € arg max UCB;(k) O0<k<N-1 
i (7.65) 


using Ou(k) = byv/log(k)/[1 + nu(k)] l<u< kK 


where {b,,} are positive constants chosen by the user. Regardless of their values, the decision rule 
(7.65) enforces ny(k) — co as k — oo for each u, and results in logarithmic regret under mild 
assumptions [216]. 


U(k+1)=3 « ---- UCB3(k) 
_--. UCB2(k) ---- UCB (k) 
---- UCB; (k apes 
1( ) — =P eee P3(k) 4(k) 
wees #5(k) —,;, —T 


Figure 7.6: Four armed bandit under the UCB policy (7.65). 


Fig. 7.6 shows an example of this policy with K = 4. In this 4-armed bandit we have 7* = 7. 
The value U(k + 1) = 3 is selected through the rule (7.65), even though 73 < 72. In this example 
it is likely that U(k + j) = 3 for several consecutive values of 7 > 1, but the value of UCB3(k + 7) 
is also likely to decrease each time this sub-optimal arm is selected. 


Finer bounds Lai and Robbins in 1985 considered the special setting (7.61), and introduced 
a procedure to obtain a sharp logarithmic lower bound on the regret. It is based on the relative 
entropy (or Kullback-Leibler divergence) between the densities. For each 0, € R* this is defined 
by 


— fp too( FO) een: 8) dp 
Dale) = | toe ( FS) £070) d (7.66) 


It is obviously zero for € = 0, and is also known to be everywhere non-negative. The bound obtained 
in [208] for an optimal algorithm, minimizing the mean regret, is expressed as an approximation: 


1 ae 
[Ly] ~ (X Dee ~7il) 8) (7.67) 
where the symbol ~ means that the ratio tends to one as NV — oo. 


7.9 Exercises 


7.1 Rover with full observations. 


(a) Sketch a graphical model of this MDP, as for an uncontrolled Markov chain (to make sense 
of the complex description of the rover). 


(b) Obtain the solution to the ACOE (h*,17*, )*) using VIA, or using common sense! 


That is, it shouldn’t be hard to guess ¢*, and then you simply solve Poisson’s equation for a 3-state 
Markov chain. 
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7.2 This problem is intended as preparation for Exercise 7.3. Consider the state space model 
a(k +1) = (1+ a)az(k) — u(k) (7.68) 


where a > 0. The state and input evolve on R;. Solve the total-cost Bellman equation with the cost 
function c(z,u) =x+u. Suggestion: Guess a polynomial representation for J* and experiment. 


7.3 Consider the MDP model in which the state evolves as follows, 


X(k +1) =X(k) + S> Ai(k+1) —U(k) + N(K+1) 
4=1 


The state and input evolve on Z; = {0,1,2,...}. Assume the following: 


(i) {A%(k), N(k) >1,i> 1} are each i.id., with distributions supported on Z4. 
(ii) The mean 7 = E[N(1)] is finite. 
(iii) 0 <@< 1, where @ = E{A!(1)]. 
(iv) The cost function is c(a,u) = a+ u. 
(v) The input is subject to the hard constraints, 0 < U(k) < X(k). 


Find h* and n* that solve the ACOE. 


7.4 Basic MDP modeling. Each quarter, the marketing manager of a retail store divides customers 
into two classes based on their purchase behavior in the previous quarter. Denote the classes as D 
for low and H for high. The manager wishes to determine to which classes of customers she should 
send quarterly catalogs. 


The cost of sending a catalog is $15 per customer and the expected purchase depends on the 
customer’s class and the manager’s action. If a customer is in class LZ and receives a catalog, then 
the expected purchase in the current quarter is $20, and if a class L customer does not receive 
a catalog her expected purchase is $10. If a customer is in class H and receives a catalog, then 
her expected purchase is $50, and if a class H customer does not receive a catalog her expected 
purchase is $25. 

The decision whether or not to send a catalog to a customer also affects the customer’s classification 
in the subsequent quarter. If a customer is class L at the start of the present quarter, then the 
probability he is in class DL at the subsequent quarter is 0.3 if he receives a catalog and 0.5 if he 
does not. If a customer is class H in the current period, then the probability that he remains in 
class H in the subsequent period is 0.8 if he receives a catalog and 0.4 if he does not. 

Of course, the manager would like to maximize her average reward. 


(a) Formulate this as an infinite-horizon discounted Markov decision problem. Describe the con- 
trolled transition matrix and one-step “cost function” c(x,u) (the negative of the reward function). 
(b) Formulate an associated fluid model. Discrete time is best. This can be described as a 
one-dimensional model with state x, in which x(k) denotes the number of class H customers at 
time k (relaxing integer constraints, as we always do for the fluid model). 

(c) After reviewing Appendix B.2, describe the value iteration algorithm (VIA) for this MDP 
model. Explain in detail how V,41 is obtained from V,, for this specific model. 
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7.5 Consider the speed-scaling model with abandonment, in which the arrival process takes values 
in {0,1}, and is conditionally independent: 


* a < 100 


P{A(k+1) =1| X¥,U%; X(k) =2} = 
{A( ) =1| Xo, U9; X(k) = x} fi ae 400 


with 0 < 6 < 1 the abandonment rate. 
The dynamics remain the same, X(k +1) = X(k) —U(k) + A(k +1). 
(aa) Provide a clear proof that X is a Markov chain on a finite state space, when U is defined 
by a stationary Markov policy satisfying b(n) > 1 if n > 100). 
The remainder of the assignment is numerical, and requires algorithms from Appendix B.2. You 
will approximate the solution to the ACOE using 6 = 0.99, and c(z,u) = x + u?. Given an 
approximation h, you have a policy denoted @"(x) = arg min, {c(x, u) + Pub (x)}, and an estimate 
of n*. Stop the algorithm when the maximal normalized Bellman error €(h) is less than ¢ = 10~?, 
with 

o def 


. 1 
hy min( max max[1,c(x, pb” 


h 
capil) +r = Lele, @*(e)) + Pork (x)}|) 


In all three cases, plot b*(x) as a function of x (the answers may differ wildly!) 


Comment on what you believe is the most efficient algorithm. You may want to explore the 
tolerance, and see how total time depends on tolerance: ¢ = 10~°, 107’, ... 


(a) VIA with initial value function Vo of your choosing. You might solve a fluid model optimal 
control problem for this model to obtain a useful initialization. 


(b) PIA with initial policy of your choosing (review Perron-Frobenius for this). 


(c) LP approach: maximize 7 over all (h,7), subject to the set of inequality constraints: 


c(z) — n+ > Pula, y)h(y) — h(x) > 0 [more than 10° constraints on A] 
yEx 


for each x € X = {0,1,...,100} and u € U(x) = {0,1,..., x} 


7.6 Continuing with Exercise 7.4, let’s now compute an optimal policy for varying customer 
populations. Let N denote the total number of customers, and solve the problem three times, 
for N = 107, 104, and 10° (reduce if necessary). 


(a) Let’s start with control of a fluid model in discrete time, 

a(k +1) = x(k) + F(2(k), u(k)) 
where u(k) is the two dimensional vector that indicates the number of catalogs sent to high/low 
customers, and x(k) is the number of high ranked customers. 
An equilibrium is a triple (2, u) = (2, u,u") such that F(x, u) = 0. Denote the cheapest. equilib- 
rium by, 


(x*,u*) = arg min{c(z, u) : such that F(x, u) = 0} 


(x,u) 
You can easily compute this! The fluid value function might be defined as, 


T*-1 


K*(2) = min S° [e(a(k), u(k)) — e(a*, u*)| 
k=0 
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where 7* is the first time that the pair (x*,u*) is reached. You might find this hard to compute, 
so use this: 
J*(x) = lim [Jn(x) — Jn(2*)] 
n—-oo 
where J* is the finite-horizon value function, and x” is defined above. 
Compute J* and the optimal policy using value iteration. 


(b) Compute the solution to the ACOE using value iteration. Try initializing with zero, and with 
J*, and compare your results. 


(c) Compute the solution to the ACOE using policy iteration (you must jump ahead to Section 9.1 
or the Appendix). Try an initial policy that is completely stupid (your choice), and also a policy 
based on the optimal policy for the fluid model. 


Provide plots of both policies and value functions in (a)—(c). Discuss your findings! 


As a criterion for convergence, choose ¢ = 1077, and stop the algorithm when the span seminorm 
of the Bellman error is no greater than e: 


é > |[Ellsp = 4 [max (x) — min E(z) 
x x 
As a criterion for performance of the algorithm, compute the number of flops required in each 


experiment (fluid and stochastic). 


7.7 are told that (X,U) is the state-input sequence for a controlled Markov chain with finite 
state space and input space: X = {1,2,3,4,5} and U = {0,1}. The controlled transition matrix P,, 
is not known. Explain how you would estimate it given a sequence of observations of the Markov 
chain {X(k) : 0 < k < 10°}. As part of your answer you must explain how you would choose the 
input sequence {U(k):0<k < 10%}. 


7.8 Consider the average-cost optimal control problem with cost function c and relative value 
function h*, solving the ACOE, 


min{c(z, u) + Pyh* (x)} = min{e(a, u) + do Pula, y)h*(y)} = h*(a) +n" 
yEXx 


Assume that the state space and action space are finite. Consider the value iteration algorithm 
designed to approximate the Q-function (7.7): 


Qnar(z,u) = o(x,u) + PQ, (2), where Q, (y) = min Qn(y,u). 


For each n, the function Q, is the value function for a finite-horizon optimal control problem, 
similar to what we saw for ordinary value iteration (see (7.44)). Conjecture on the form of this 
optimization problem, and see if you can justify your claim. 


7.9 Consider the one dimensional inventory model defined by the recursion, 
X(k+1) = X(k)+ S(kK+1)U(k) — A(k+4+1), k>0, X(0)€X=R, 


where (S, A) is i.i.d. as in the CRW queue. However, in this model, A(k) represents new demand 
for some product at time k, and S(k) denotes a potential product completion at time k. A positive 
value of X(k) corresponds to excess inventory, and a negative value deficit. This is a simplified 
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Surplus Deficit re) 


— ~.—. 


Figure 7.7: Single-station demand-driven model. The deficit buffer is non-empty means that X(k) < 0. 


version of the model shown in Fig. 7.7 since we are modeling only the inventory buffer, and not the 
buffer modeling storage of raw materials. 

A piecewise linear cost function c: R > R¥4 is given, of the form c(z) = c_x_ + cyx4 with 
0<c_ <c4, 24 = max(z,0), and x_ = max(—z,0). 


In this exercise you will restrict to a threshold policy of the following form: Given a constant 
def _ 


Zz > 0, define U so that X'(k) = Z— X(k) is precisely the CRW queue under the non-idling policy. 
Hence X(k) is restricted to the state space Xz = {7,Z—1,%—2,...}, and U(k) = 1 if and only if 
X(k) <%-1. 

For the same threshold, the fluid model is defined by 


ageet = 


d w-a ifu<f 
0 L="z 


for any initial condition x9 = x < Z, where w and a are the mean values of S(k) and A(k) 
respectively. That is, for 79 < %, x; increases linearly until it reaches the level %, and thereafter 
stays at this level. 


(a) Compute the fluid value function, 


[oe 
a= if [e(a:) —c(Z)|dt, 2 <r. 
0 
Verify that Jz is continuously differentiable, and satisfies the dynamic programming equation, 


(u — a) Jz(x) = —[e(a)—c(@)], to = ST. 


(b) Using Lemma 7.3, show that the function b = [P — I]Jg +c is bounded on Xz. This 
approximation implies that Jg “almost” solves Poisson’s equation for X. 

(c) Estimate the optimal value of % that minimizes the average cost using simulation. Take 
c+ = 10c_, p = a/p = 0.9, and choose the distribution of (A, S) so that the variance of A(k) — S(k) 
is between 1 and 5. 

Perform two sets of simulation experiments. In each case, you will conduct 10 experiments with 10 
different values of %. Let (S’, A‘) denote the sample paths used in the ith experiment, 1 <7 < 10. 
Experiment 1: Ten independent runs are called, so that (S°, A’) is independent of (S’, A’) for 
iXj. 

Experiment 2: For this you must use a form of coupling: (S°, A’) = (S', A!) for each i. There is 
a way to set the “seed” in the random number generator to simplify this experiment, so that the 
code for Exp. 2 is similar to Exp. 1. 


Discuss your findings. Do the plots of average cost vs. © look better in experiment 1 or 2? Explain 
why this can be expected. 
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7.10 Consider the one dimensional inventory model Exercise 7.9: 
X(k+1)= X(k)+ S(k4+ 1)U(k) — A(k +1), k>0, X(0)€X=R, 


where (S, A) is ii.d. as in the CRW queue. 
(a) Suppose that h* is convex. Explain why the function, 


hl (x) = E[h* (a2 + A(k + 1))] 


is convex in x. Explain why this implies that there is a threshold policy, as defined in Exercise 7.9. 
The following property of convex functions will be helpful: If g: R — R is convex, then for any 
y > 0, 

g(a +y)— g(x) is non-decreasing 


Why? Because, g is convex if and only if its derivative a g(x) is non-decreasing. 


(b) Policy construction of Clark & Scarf. Let n° denote the steady-state distribution for X 
obtained using = 0. Explain why it follows that for arbitrary %, the resulting steady-state 
distribution 7 characterized as follows: for any function g: R > R, 


Exlg(X)] = Enolg(X + 2)] 


Letting g = c, obtain a characterization of %* through differentiation. 

(c) Use value iteration to approximate the optimal policy. Note that you will have to truncate 
the state space, and return to an integer lattice. For example, take X = {7 +m:0<m< M} for 
some fixed M, with % an initial guess for a good threshold value. 


Try two initializations: Vo = 0, and Vo inspired by a fluid value function. Compare the speed of 
convergence of V,,41(0) — V;,(0) to 7* in the two cases. 


7.11 Consider a modification of the “well-motivated but unstable” policy used for the queueing 
network considered in Section 3.9.3. The problem with this policy is the focus was entirely on 
draining from the two “exit buffers”, without considering potential starvation of a station. In this 
exercise you will consider the following modification for the CRW model: given a pair of thresholds 
ij T2), 

> The policy is assumed non-idling: Ui(k) + Us(k) = 1 whenever Qi(k) + Qa(k) > 1, and 
Uo(k) + U3(k) = 1 whenever Qo(k) + Q3(k) > 1. 
Subject to this constraint are priorities at each station: 
> Priority is given to buffer 4 at Station 1 if Qo(k) + Q3(k) < 1 
> Priority is given to buffer 2 at Station 2 if Qi(k) + Qa(k) < t 
Obtain a plot of average cost as a function of (71,72) via simulation. Perform multiple runs with 
at least one pair so you can estimate the variance of your cost estimates. 


After you learn about actor-critic methods you might try this to optimize (71,72) using one of these 
algorithms. 


7.12 Consider a discrete-time scalar process initialized with X(0) = 1 and evolving according to 


2X (k) with probability 2/3 


X(k+1)= 
( ) eee with probability 1/3 
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At each time t, we can opt to stop the process and receive a payoff of G(X(k)), or to continue. 
If the process is not stopped within 100 time steps, it terminates automatically and the payoff of 
G(X (100)) is received. The goal is to maximize the expected discounted payoff. Solve the problem 
using VIA, with discount factor y = 0.95, G(x) = max(0,1— 2) and G(x) = min(0, 1-2). In each 
case, provide the maximal expected payoff. 

7.13 Risk sensitive optimal control. This exercise is a followup to Exercise 6.22. 


(a) Consider the finite-horizon optimal control problem, with value function 
N-1 
hia) =minEe(Z], 2 = Ye X(k),U(H)) + Vo(Xw), X(0) =2EX (7.69) 
k=0 


The corresponding risk-sensitive control problem is motivated by the need to penalize variance 
while minimizing the mean of Z: for fixed r > 0, the risk-sensitive value function is defined by 


Aya) = min log(Ex[exp(rZ)]) , X(0)=2%2EX (7.70) 


Obtain a dynamic programming equation for (7.70): given Hx;_,, obtain an update equation for 
It will help to assume X = {1,...,7} is finite, and your solution will depend on the matrix with 
non-negative entries: 


Ru (i, 7) = exp(rc(i, u)) Py (Zz, 7) 


(b) Postulate a dynamic programming equation for the infinite horizon problem with objective 


N 
" hs ieee. ol 
Ag) = min jim | NW log{ Ex [exm(r Lo) u(k)))| \ ; X(0)=r2EX 
See Exercise 6.22 for inspiration. 


(c) Solve the risk sensitive control problem numerically for the MDP in Exercise 7.1. 


7.14 Rover with partial information. Review Prop. C.3, and write down the formula for the non- 
linear filter for the POMDP described in Section 7.7: 


X(k+1)=M(X(k),V(k+1),U(B)), k>0 


Take {I'(k)} in (7.54) to be i.i.d. Bernoulli with parameter 9 = 0.1. 


The fact that observations depend on the input doesn’t cause any difficulties: you will need to 
define q(y | x, u)]. 


(a) The MAP estimator of X(k) given observations up to time k is given by X™?°(k) = 
arg max, ¥,(k). Plot the evolution of the true state and its sequence of MAP estimates using 
two different policies: (i) degenerate, and (ii) a policy of your choosing in which (b) = B for b 
in a neighborhood of T. Plot the estimate X“*’(k) and X(k) on the same plot in each case, and 
estimate the MSE (mean-square error) over the run. Please use the same random seed for cases 
(i) and (ii). To estimate the MSE you may want to run multiple independent trials with different 
seeds. 
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(b) Fix values of {ar,ap,ag} of your choosing, and plot the average cost as a function of r along 
with to confidence bounds. It is even more important to use multiple runs in this case, since you 
need to estimate the variance o” of your estimate. 


This is another great example for testing Q-learning and actor-critical algorithms developed in later 
chapters. 


7.15 Parameter estimation. In Section 7.8.2 we saw that estimation can be cast as a state estima- 
tion problem. The purpose of this exercise is to look at a simpler estimation problem, with the 
“exploration” issue set aside. 


We are given scalar measurements 

Y(k) =04+ W(k), 
and wish to obtain an estimate 6(n) of 0, based on n observations. If you are told that W is a zero 
mean sequence, a natural choice is Monte-Carlo, 


n-1 


Oic(n) = ~~ Y(t) 


k=0 


Suppose instead that it is known that W is the output of a stable filter: for some 1 x n matrix G, 
and n xX n matrix F' we have for k > 0, 


Z(k+1)=FZ(k)+N(k+1),  W(k) =GZ(k) 


It is assumed that N is Gaussian white noise with marginal N(0,¥j), and that 6 has a known 
Gaussian distribution that is independent of N. 

(a) Write down state equations for this system with state X(k) = (Z(k),0)™, and write down the 
Kalman filter equations to estimate X and 0. 

(b) Consider the scalar case n = 1. Do the equations simplify? 

(c) When F' = 0 the solution simplifies dramatically (follow the derivation of &;, for the Gaussian 
bandits). Compare the optimal estimator to 6c in this special case. 


7.16 This exercise concerns the Gaussian bandit introduced in Section 7.8.2 with only two arms 
(K = 2). If the variance of the rewards is equal across arms then we obtain a slightly simpler 
model: 

R(k) = U(k)X1 + (1 — U(k))X2o+W(k), k>0 


where {X;} are independent Gaussian random variables, the “arm” is defined by U(k) € U = {0,1}, 
and W is i.i.d., N(0, Tar) and independent of X. Recall this may be regarded as a POMDP with 
observation process Y = r. 


For any choice of input, and each 7, the conditional distribution of X; given the observations up to 
time k is Gaussian: for all S C R, 


Pix, €S|¥' Us} = | pi(x; k) dx 
S 


1 1 
where p(x; k) = Se exp{ 252(h) (z mi(k))? } 


(a) Obtain expressions for {o?(k) ,mi(k) :i=1,2, k > O}. 
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In the remaining numerical experiments take Ce = 1, and (Xy.io)! == (°). 


(b) ‘Try out the certainty equivalent policy (7.58) over a fixed time horizon. Perform multiple 
independent runs, and present a histogram of the regret. 


(c) Repeat (b), but with (7.58) replaced by a policy of your choice (such as the UCB rule (7.65)). 


7.10 Notes 


See [46, 45, 162, 291] and the collection [129] for much more on the foundations of MDP theory. In 
particular, [46, 45] also emphasizes the close parallels between deterministic and stochastic control 
theory (as does the earlier classic [43]). Other must-reads include [106, 123, 128]. 

See [81] for an encyclopedic treatment of stochastic linear systems, including many approaches 
to control, state estimation, and parameter estimation. Textbooks covering basic material include 
(205, 7, 77]. Section 7.2 and analysis of the speed scaling model is adapted from the book chapter 
[165], based on a longer history e.g., [141, 92, 8]. The article [92] was the outcome of a class project 
for stochastic control at the University of Illinois during the 2008 fall semester. 

The “queueing game” introduced in Section 7.6 is based on [255], inspired by conversations 
regarding the “Mori-Zwanzig formalism” for model reduction. In this example, the 4 dimensional 
control problem is replaced with one of dimension 2. The soft state aggregation approach to function 
approximation [324] is also based on the construction of a simple Markov model, defined via Bayes’ 
rule as in eq. (7.52). The origin of this idea can be found in remarks by Claude Shannon in his 
1948 paper [316], regarded as the birth of modern information theory. 

The concept of the belief state for optimal control of Markov chains with partial observations 
is credited to the 1960 paper of Stratonovich [334], followed shortly after by Astrém [13]. Two 
valuable resources are available online: van Handel in [362] treats theory of nonlinear filtering 
(the recursion that generates the belief state), and Krishnamurthy [201] contains a nice survey 
and history of POMDPs, highlighting structural results for value functions that inspire function 
approximation architectures for RL (some conclusions most relevant to this book are discussed at 
the end of Appendix C.3). 

See [336, 337, 178] for principled approaches to construct an approximate belief state in appli- 
cations to RL, along with substantial history on the topic. 

There was no time or space to include more examples and exercises related to games. Exer- 
cise 3.9 would provide a great example for RL algorithms to come, and for a large collection of 
interacting agents it is possible to predict the optimal solution through the theory of mean-field 
games [167, 166, 214]. Examples illustrating the application of mean-field theory to RL can be 
found in [246, 380, 382, 381]. 

The bandits literature has taken off in the past two decades, so just a few historical notes are 
provided here. See [79, 216] for comprehensive and recent treatments of bandit theory. 

First, remember that UCB refers to upper confidence bound, a concept introduced by Lai and 
Robbins in their derivation of (7.67). The version (7.65) was introduced in [17], which contains an 
elegant proof of logarithmic regret in a strong sense, following the asymptotic analysis of a similar 
flavor in [3]. 

The reference to the empirical distributions (7.62) is intended to recall their use in information 
theory, where they together with relative entropy play a role in explaining channel capacity and 
error exponents [105], as well as the sharp bounds of Lai and Robbins. See [293, 295] for the 
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use of similar information-theoretic techniques to obtain lower bounds on performance in system 
identification and optimization. 
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Chapter 8 


Stochastic Approximation 


Quasi stochastic approximation is the engine behind the gradient-free optimization algorithms 
surveyed in Section 4.6 and Q-learning algorithms introduced in Chapter 5. The history of stochastic 
approximation (SA) is far older, so that SA techniques are far more familiar to algorithm designers. 

The goal is identical to the starting point of Section 4.5: we wish to solve the root finding 
problem f(6*) = 0, where f: R? — R?@ is defined as an expectation: 


f(0) ZE[f(0,®)], oeR? (8.1) 


As in the deterministic setting, f: R? x O > R4, and © € O is a random vector. 
The SA recursion is entirely analogous to the QSA ODE (4.44): 


Stochastic Approximation 


For initialization 6) € R%, obtain the sequence of estimates recursively: 


On41 = On + Gniijnpa(en) (8.2) 


where fn4i(On) = f(On, ®(n+1)), &(n) has the same distribution as ® for each n (or its distribution 
converges to that of ® as n — oo), and {a,} is a non-negative scalar step-size sequence. 


Analysis begins with a representation of (8.2) as a “noisy” Euler approximation, 


On+1 = On + An+1[f (An) + An+il 4 nm = 0 (8.3a) 
in which = Ansa = fnti(On) — F(On) and Leg = frill) (8.3b) 
Stability of the ODE and additional minor assumptions imply consistency: 


lim 6, = 0* with probability one 


Nn—->oo 


See Thm. 8.1 for a proof that follows theory in Section 4.5. 
SA is one component of the ODE method applied in the remainder of the book: 


278 
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ODE Method. 


1. Formulate the algorithmic goal as a root finding problem f(6*) = 0. 
2. Refine the design of f, if necessary, to ensure that the associated ODE is globally 


asymptotically stable: : 
19 = f(9) (8.4a) 


The Newton-Raphson Flow introduced in Section 4.3 is one example of this step. 
3. Is an Euler approximation appropriate? 


On+1 = On + Ceenien| 5 nm > 0 (8.4b) 


In particular, is f Lipschitz continuous? 
4. Design an SA algorithm to approximate (8.4b). 


Asymptotic covariance The rate of convergence of the error sequence 0, = 0, — 6* can be 
measured in terms of the error covariance 


En = Elen (8.5) 


from which we obtain the mean-square error (MSE) o2 © Ej|\|6,,||2] = trace (X;,). When using the 
step-size a, = g/(n + no)? with no > 0, g > 0, and p € (0,1), the asymptotic covariance is defined 
to be the limit: 

Ne = lim n?Xn (8.6) 


noo 


The limit exists and is finite under mild assumptions. Moreover, it admits an expression in terms 
of simple algorithm primitives: the asymptotic covariance of {A°°} appearing in (8.3b), and the 
linearization of f at 0*. See Section 8.1.5 for more details. 

Similar to the definition (4.68), we say that o2 tends to zero at rate 1/n (with p > 0) if for 
each e > 0, 


lim n!-£o? =0 and lim n'téa2 = 00 8.7 
n—0o ™ n—0o ” 


Hence when (8.6) holds with non-zero limit 4g, we obtain (8.7) with = p. For the applications 
considered in the remainder of this book, the optimal convergence rate is obtained using p = 1: 


on = E[l|On — 0*||7] = O(1/n) (8.8) 


However, in Section 8.1.5 we see that this fast convergence is only possible with g > 0 sufficiently 
large when using a, = g/(n + no). 

The theory of convergence rates for QSA surveyed in Section 4.5 was based on consideration 
of a scaled process in continuous time, Z; = a; '@,. The continuous time setting was useful for 
approximating its dynamics by a linear ODE. A similar approach is used for stochastic approxima- 
tion, but with a few significant differences. First, the convergence rate is typically much slower: to 
be precise, we are forced to introduce a square-root in the scaling: 


Zn = a6, (8.9) 
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We also lose some of the simplicity of ODE analysis. 
Along with these mean-square error approximations is a Central Limit Theorem, for which ‘Ng 
is the asymptotic covariance (under general assumptions): 


lim Zn ew, Wx N(0,Ds) 


where the convergence is in distribution. Numerical results contained in Section 8.3 and in many 
examples later in the book show that the CLT is often a good predictor of algorithm performance 
in applications to optimization and RL. 

It is possible to optimize the asymptotic covariance over all algorithms in a prescribed class: 
see Section 8.1.5, where a formula for the optimal covariance 4% may be found in (8.30). Three 
“ancient” approaches are, in chronological order: 


> Stochastic Newton Raphson (SNR) of Chung [95]. 

> Stochastic quasi Newton-Raphson of Ruppert [305]. 

> The averaging technique of Ruppert [306], and Polyak and Juditsky'” [287, 288]: see Thm. 8.13. 
The averaging technique is defined exactly as for QSA in (4.78): 


Polyak-Juditsky-Ruppert Averaging 


For initialization 0) € R%, obtain the sequence of estimates {6;,}, and a final estimate 6°8 as follows: 


Ons1 = 9n + Bntifnti(On) » OsnsN-1 (8.10a) 
i N 
=> Dd) H (8.10b) 
N= No k=No+1 


where 1< No < N. The step-size sequence {,,} is square-summable, and satisfies 


lim 7By, = co (8.10c) 
N—->Oo 


Two recent techniques to optimize the asymptotic covariance are 
> Zap-SA: intended to approximation the Newton-Raphson flow (4.14a). 
> Matrix momentum algorithms [108]. 


Zap SA algorithms are defined in Section 8.1; they stand out because they are stable under minimal 
assumptions on the model. Momentum methods are unfortunately beyond the scope of this book, 
but the first order version of Zap SA described in Section 8.4.2 is similar to the momentum algorithm 
NeSA of [108]. 


17 As in Part 1 of the book, the “J” is omitted from the super-script in (8.10b): this is to keep notation compact, 
and also because of Polyak’s independent work before Juditsky. 
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8.1 Themes and Roadmaps 


This section is intended to provide an overview of SA theory, and how theoretical insights are used 
to create algorithms. It is a complex section that deserves its own roadmap: 


(i) Section 8.1.1 simply recalls a basic message: it is often best to consider an ODE as a starting 
point in algorithm design. Both stability and rates of convergence of your algorithm will rest 
on properties of the ODE. 


(ii) It is essential to understand what we mean by ODE approximation of a recursive algorithm. 
Section 8.1.2 provides an explanation that parallels the QSA theory of Section 4.9. 


(iii) The choice of step-size is not at all obvious. A few minimal requirements are explained in 
Section 8.1.3, and discussion surrounding two time-scale SA algorithms (distinguished by two 
separate step-size sequences) is contained in Section 8.1.4. 


(iv) Next, we need a way to distinguish between two SA designs, and determine which is better. 
We first review in Section 8.1.5 the standard performance metric in machine learning and RL 
based on sample complexity bounds. Mean-square error is preferred in this book for many 
reasons surveyed in this subsection and later in the chapter. For example, the asymptotic 
covariance “ig appearing in (8.6) solves a “Lyapunov equation”, which leads to tools for 
algorithm design. 


(v) By definition of “asymptotic”, transient behavior is ignored in asymptotic analysis. The 
potential tension between asymptotic and transient performance is the topic of Section 8.1.6, 
which also shows how PJR averaging is a means to break this tension. 


8.1.1 ODE Design 


Consider first an ideal setting in which the distribution of ® is known, and it is not costly to 
evaluate f(@) for @ € R?. In this case estimates of @* might be obtained using the ODE 


r = 
ave = f(%) 
or an Euler approximation (an instance of successive approximation): 
On41 = On, + On+if (On) (8.11) 


Convergence of the ODE or (8.11) is possible through careful design of the function f that deter- 
mines f. 

The Newton-Raphson flow introduced in Section 4.3 is an ODE that is convergent under minimal 
conditions on f (see Prop. 4.4). Its Euler approximation is 


On41 = On — An41 [A(Gn)| f (On) ’ A(n) = [Oe f(A) | 0=On (8.12) 


where 0g f denotes the Jacobian as in (4.13). Global convergence of the Newton-Raphson flow holds 
under very mild conditions. The same is true for (8.12) if the step-size is chosen with care—informed 
by a rich literature on ODEs. It may turn out that a more efficient and reliable approximation is 
obtained using a more sophisticated Runge-Kutta method. 

When the step-size in (8.12) is set to unity, this becomes the Newton-Raphson algorithm: 


Ont = On — [A(On)] ~~ F(On) (8.13) 
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Under mild conditions, the estimates converge extremely rapidly to 6*: 
li : ] 6, — 0 0 
Jim, — loe{(l@n — 6") < 


That is, ||@, — 6*|| < Bexp(—en?) for some B < oo and € > 0. However, this convergence is only 
local: valid for 6) in a neighborhood of @*. 


8.1.2. ODE approximation 


Comparison of (8.3a) with an ODE requires a time transformation similar to the introduction of Tt 
in (4.73) for analysis of QSA. Our first introduction to a standard Euler approximation was (4.8), 
in which the time-points {t;,} were assumed given. Here the step-size sequence {a,} is given, and 
we define the time points via To = 0, and Ty41 = Tr + Qg41 for k > 0. 

The ODE approximation is based on a comparison of two processes in continuous time: 


(i) ©: = 6, when t = Tx, for each k > 0, and defined for all t through piecewise linear interpo- 
lation. The notation is used to emphasize the similarity with the QSA parameter estimates 
considered in Section 4.5. 


(ii) For each n > 0, let £9”) :t > T,} denote the solution to (8.4a), initialized according to the 
current parameter estimate: 


29”) — Fo), t>m, 9” =6, (8.14) 
(n) 


Expressed as a differential equation, it seems difficult to compare 97,’ and 6, for k > n. This is 
why the ODE approximation is obtained in integral form. 


The cumulative disturbance is defined for any kK > n by 


kK 
My) = S> aid; (8.15) 
i=n+1 


with A; defined in (8.3b). Iteration of (8.3a) then gives, 


K 
O., =On+ s oti f (Oz,) + Mi) 
i=n+1 (8.16) 


=0,+ f° F(O,) dr +E) 


where Eln) is the sum of M Me ) and the error resulting from the Riemann-Stieltjes approximation of 
the integral. This disturbance term will vanish with n, uniformly in K, subject to conditions on 
{A;} and the step-size. 

The integral representation of solutions to the ODE are identical, with the disturbance removed: 


9 A .< = 
9 = 6, + | F(8™) dr (8.17) 
Tn 


Theory in Section 4.9 can then be applied to establish the following: 


Pre-publication draft -- March 25, 2022 


CHAPTER 8. STOCHASTIC APPROXIMATION 283 


Figure 8.1: ODE approximation on time intervals [Vz,.\.41) of width approximately T. 


Theorem 8.1. Suppose that the following hold: 
> f is Lipschitz continuous. 
> The parameter sequence {0,,:n > 0} is bounded a.s. 


> The disturbance vanishes in the following uniform sense: for each T > 0, 


(nm) )) _ 
Jim oP | Mi’ || =0 a.s. (8.18) 


where the supremum is over K satisfying K >n andtk —T™<T. 
Then, 


(i) For each T, 
lim sup 90”? — ©;|| = 0 a.s. (8.19) 


NFO oy <t<t+T 


(ii) If the ODE (8.4a) is globally asymptotically stable, with unique equilibrium 0*, then 


lim ©; = lim 6, = 6* a.s. 
t-00 N—- Oo 


Proof. Part (ii) of the theorem is illustrated in Fig. 8.1 (see also Fig. 8.5 and surrounding discussion). 
Global asymptotic stability combined with boundedness of the parameter sequence implies the 
following: for any given 6 > 0, there is T’ < oo and € < 6 such that 


(i) If 9%? — 6*|] < © then ||9” — 6*|| < 6 for t> tp, and 
(ii) yoo”? — 6&*|| < ¢/2 for each n and all T> 1, +T. 
The sampling times indicated in the figure are defined by No = 0 and for n > 1, 
Ny = min{t, : tT, > Nn-1 + Th 


Each of these times can be expressed NV, = Tn,, for some integer n, > n. They are constructed so 
that the ODE solution oir) on [N,, 00) will satisfy orm — 6*|| < e/2 for tT > Nn4i- 
Applying (i) we conclude that for some integer n(6) and all n > n(0), 


Over — "ll < Owner — 90") | +9), — Or <e 
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(M41) 


By definition, 3 Naa = Ow,; is the initialization for the next interval, which implies that 


9) —O*|| <5 forall k>n(6)+1, t>M 


From (8.19) we then obtain 


lim sup ||9;, — 6*|| = lim sup ||©; — 6*|| < 6 
noo t-00 
This establishes convergence, since 6 > 0 is arbitrary. O 


Projection and re-start The introduction of ng > 0 in the step-size sequence a, = g/(n+no)? is 
to avoid excessive gain when n is small. Anyone who has experimented with SA has seen parameter 
estimates explode in the first few iterations. Without understanding of the asymptotic covariance, 
it is likely that a user will reduce the value of g, not realizing this might result in infinite asymptotic 
covariance. Increasing no can tame the algorithm, and has no impact on the asymptotic covariance. 


Two alternatives are available, each based on a closed region R C R® satisfying @* € R (in 
practice, the choice of this set is much like the choice of no: through trial and error). 
Projection. The definition requires some notion of distance: dist(w,v) > 0 for any w,v € R4 with 
dist(v, v) = 0 for any v. It is typical to use either the standard Euclidean norm or the max-norm: 
dist(w, v) = |/w — v||oo = max; |w; — v;|. For any vector v € R™, the vector v! = Ip{v} satisfies 


vu’ € arg min{dist(w, v) : w € R} 
Projection applied to the basic algorithm (8.2) is defined by the recursion, 


On+1 = TIr{6n + On+1fn+1(On) } (8.20) 


There are two difficulties with this approach: one is the potential computational complexity in 

updating the parameter estimate, and the other is complexity in an ODE analysis. How do we 

know that the parameter sequence doesn’t become trapped on the boundary of R? 

Re-start. Whenever 6,41 ¢ R, we simply reset its value to a vector within the interior of R. For 

example, with R = {6 : ||@|| <r}, you might choose the reset parameter to be 6,,,, € R by scaling: 
j r On+41 


mer 2 [Pn 


This procedure is far simpler than projection, and analysis is simpler. 


8.1.3 Choice of step-size 


The next question is how to choose the step-size appearing in (8.2). A constant step-size is often 
preferred in applications, though choice of the constant value remains an art-form. Also, with a 
constant step-size we cannot expect convergence, unless the covariance matrix defined in (8.27) 
is identically zero. This degenerate special case is not found in the applications of interest in 
the remainder of this book, so we assume a vanishing step-size: limy+.99Q@, = 0. Two further 
assumptions are imposed. First, the discrete-time version of Assumption (QSA1) of Section 4.5.4: 


oo 
) Qn, = OO 
n=l 
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This is imposed so that it is possible to reach 6* from each initial 60. 
The second assumption is more strongly rooted in probability theory: 


CO 
Sa, < 00 (8.21) 
n=1 


In particular, it is used to establish (8.18), which is required for convergence of parameter estimates 
in Thm. 8.1. The proposition that follows makes clear why we require (8.21). The uniform bound 
on the second moment of {A;} is relaxed in Prop. 8.7 (see (8.60b)). 


Proposition 8.2. Suppose that {Ax} is an uncorrelated sequence with bounded covariance: a4 = 


sup; E[||Ax||?] < oo. Then the second moment of (mu : Kk > 1} appearing in (8.16) admits the 
uniform bound 


CO 
E(IMP 7] < 3% So a 
j=nt+l1 


The right hand side vanishes as n —- oo, provided (8.21) holds. 


Proof. If {A,} is an uncorrelated sequence, then the covariance of Mf i ) satisfies 


K 


n n)T 
E[My My” ]= Soo, BElAyeA) 
i=1 
and consequently, 
K 
E[|| Mg? ||?] = traceEIMy? ME") <a, S> a? o 


8.1.4 Multiple timescales 


There are many examples in which it is valuable to use different step-sizes for different parameter 
estimates. 
Consider the averaging technique (8.10). This is usually presented as a two time-scale recursion 
where (with No = 0), 
ree = On + On41 [nar — OFF] , n20 
and with a, = 1/n. A similar recursion holds for arbitrary No < N, but the recursion begins at 
n = No with 07, = 0. The assumption (8.10c) is thus equivalently expressed 


lim —=oo (8.22) 


This means that the estimates {0%} defined in (8.10a) evolve much more rapidly than {6/*}. 
The general two time-scale algorithm is described as follows: 


On+1 = On + Prtt instal Ons Wn) (8.23a) 
Wr41 = Wyn + An419n41 (An, Wn) (8.23b) 
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where {6,,} evolves in R? and {w,,} evolves in R™. The assumption (8.22) is maintained, which is 
what makes this a two time-scale recursion. 

The ODE approximation for these recursions will not be covered in any generality here, but the 
main message is something every user of SA should understand. 

As always, we require the mean vector fields f(0, w) = Ea[fn+1(9, w)] and 9(0, w) = Ealgn+1(9, w)| 
(steady-state expectations). For each w € R™, assume there is a unique vector 0°(w) solving 
f(0°(w), w) = 0. Because the evolution of the second parameter sequence is so slow compared to 
the first, it is possible to prove under general conditions that 0, ~ 6°(w,) for all large n. This 


leads to the following ODE approximation for the second recursion: 
gre = 9(0° (we), we) (8.24) 
See Thm. 8.3 for an example of convergence theory based on these concepts. 
This insight is invaluable in algorithm design: 


(i) In constrained optimization, where w,, is an approximation of a dual variable (say, in the 
setting of Prop. 4.12). 


(ii) The approximation (8.24) is essential for creating recursive algorithms that approximate the 
Newton-Raphson flow, as we have already seen in the QSA setting of Section 4.5.6. 


(iii) The elegant theory of actor-critic methods is made possible through versions of the ODE 
approximation (8.24). 


Stability theory for two time-scale SA can be found in other sources, such as [67], where the 
proof of the following companion to Thm. 8.1 can be found: 


Theorem 8.3. Consider the two time-scale algorithm (8.23) subject to the following assumptions: 
> f andg are Lipschitz continuous, and the function 9(0°(w),w) is also Lipschitz continuous 
in w. 
> The parameter sequences {On ,Wn :n > 0} are bounded a.s. 


> The cumulative disturbances are bounded, in the sense that the following sums exist and are 
everywhere finite with probability one: 


> Bn+i{fnti (On, Wn) — 7 Gee Wr) } » Aint1{9n+1(9n, On) — G(On, Wn) } (8.25) 
n=0 n=0 


> The ODE (8.24) is globally asymptotically stable with unique equilibrium w*, and for each 
w the following ODE is asymptotically stable with unique equilibrium 0*(w): 


£9, = f (91, w) 


Then, lim 0, =0* =6°(w*) and lim wp = w* a.8. O 
noo N00 


The remainder of this section contains much more on the important topic of step-size for the 
standard SA algorithm. 
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8.1.5 Algorithm performance 


An emphasis in these latter chapters of the book is consideration of covariance of parameter esti- 
mates, as defined in (8.5). The asymptotic covariance (8.6) exists under general conditions, and a 
representation is possible based on two matrices: the linearization matrix!® 


A= A(6*) = dof (6*) (8.26) 
and the steady-state disturbance covariance: 
defy. 1 T = : a) 
Ya = jim —E[MnMj}, Mn = 2 Ai (8.27) 


where A?° = fr(@*). This is also known as the asymptotic covariance of {A?°}, as it is the 
covariance that appears in the Central Limit theorem for this sequence. If {A?°} is uncorrelated 
then Ua = Ep[AP?(AR)™]. 

The following conclusions are what can be expected, subject to additional conditions (such as 
stability of the ODE (8.4a)): 


(i) For step-size an = g/(n+ no), the asymptotic covariance (8.6) is finite provided each eigen- 
value of A satisfies Real(A(gA)) < —%4, and in this case Yg solves the Lyapunov equation 
(gA + $1)X9 + Ne(gAt 51)1 +. 9°Ea =0 (8.28a) 
(ii) For a, = g/(n+ no)? with 0.5 < p < 1, the definition of the asymptotic covariance is 
modified: 
Ne = lim nd, 
N—- Oo 

This is finite provided Real(A(A)) < 0 (A is Hurwitz), and in this case Ng solves 

Av, + 3yAT Gn 0 (8.28b) 


There is of course no reason to restrict to a scalar step-size. With a, = G/(n + ng) it is 
necessary to verify that the ODE 49 = Gf (9%) is globally asymptotically stable, and in this case 
we typically conclude that the asymptotic covariance (8.6) is finite, provided each eigenvalue of GA 
satisfies Real(A(GA)) < —5. The covariance matrix is the solution to the Lyapunov equation 


(GA + $1)X9 + DF (GA + 417 + GEAGT =0 (8.29) 
The super-script G is introduced here so we can identify the optimal choice: 
Optimizing the asymptotic covariance. The Lyapunov equation (8.29) has a solution 


oY > 0 provided the eigenvalue test is satisfied: Real(A) < —5 for each eigenvalue A of GA. 
The choice G* = —A~! passes this test, and results in 


w= Awa A tit (8.30) 


This is optimal, in the sense that the difference ae — Xf is positive semidefinite. 


'8The linearization matrix A(6*) was denoted A* in the first half of the book, starting in Section 4.3. It is convenient 
to abandon the super-script here. 


Pre-publication draft -- March 25, 2022 


CHAPTER 8. STOCHASTIC APPROXIMATION 288 


See Prop. 8.10 for a special case, and the Notes section for resources. 


The RL literature today has an entirely different emphasis: computation of a finite-time error 
bound of the form 7 - 
P{||0,, — 0*|| > e} < bexp(—nI(e)) , n>1, (8.31) 


where b is a constant, and I(c) > 0 for ¢ > 0. The bound is usually inverted: for a given 6 > 0, 


denote 
1 


Ie) 
Then, (8.31) implies the sample complexity bound: P{||@, — 6*|| > ¢} <6 for all n > 7(e, 6). 
The value of a finite-n bound is indisputable: we have assurance that the probability of error 


is below the desired value, provided we wait for n > 7(¢,6) iterations of the algorithm. 
There are also significant challenges: 


ni(e, 6) = [log(b) + log(5~*)] (8.32) 


1. A sample complexity bound is typically very conservative. If the bound is very loose, then 
we will not be willing to wait for n to exceed the over-estimate 7. 


2. The bound 7% may not offer guidance on how to improve the algorithm. 
Counterparts for the parameter estimate covariance are not a challenge: 


1. a? = trace (Hp) can be estimated using batch means methods based on a short run, and 
this gives approximate confidence bounds for a longer run. This is a standard technique in 
simulation (see (6.41b)). 


2. We will see that the asymptotic covariance can serve as a guide to algorithm design. Gain 
design for Watkins’ Q-learning algorithm introduced in Section 9.6 is one simple example, 
where we conclude that a, = (1—y)~'n~! leads to the optimal O(n~!) MSE convergence 
rate without any complex calculations. 


There is a third criticism concerning any bound on the error metric ||@;|| (based on (8.31) 
or a bound on the error covariance): it is probably not what we care about! In the context 
of reinforcement learning, most valuable would be a measure of the performance of the policies 
associated with the parameter estimates, as in the theory of bandits surveyed in Section 7.8. 
Fortunately, a CLT for the parameter estimates will likely imply approximations for more relevant 
statistics, such as average cost. In this case we can borrow techniques from simulation theory to 
obtain approximate confidence bounds. 


8.1.6 Asymptotic vs transient performance 


As in the QSA theory of Section 4.9, there are tradeoffs in choice of step-size: while a, = g/n will 
typically yield the optimal convergence rate (8.8) (most crucially, g > 0 is chosen sufficiently large), 
this may also result in poor transient behavior. 

Please review the summaries in both Section 8.1.2 and Section 4.5.4 to recall that convergence 
theory for SA and QSA rests on comparing ©; with what amounts to a time-scaling of the ODE 
(8.4a) (recall (8.16) and (8.17)). Bounds on this time-scaling are easily obtained: 


Lemma 8.4. Consider the step-size an =1/n°. The following bounds hold for each n > 1: 


ie, Sg 
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where 


ae uaee pHi Chai ped 


glmt 1)? =1] p< ie P=) pat 


Proof. The following bounds hold for any ¢ > 0, 
1 2 1 Z 1 
t+17 |¢+1] ~ max(1,t) 


where |t+1| is the integer part of t+ 1; in particular, |t+1| = 1 for0 <t< 1. The lemma follows 
from these bounds, combined with the integral representation 


t= f(a) 2 


Let’s revisit the uniform convergence obtained in (8.19) which motivated the lemma. Suppose 
that the ODE converges exponentially quickly to the optimal parameter: there exists 99 > 0, 
Bo < co such that for any solution to (8.4a), and any to, t > 0, 


I[Sto-+¢ — "|| < BollSt) — 4" || exp(—aot) (8.33) 


To apply (8.19) we set to = Tp, so that 9) = 9, by construction, and choose to +t = tn+, for 
k > 1. If a, = g/n then (8.19) gives, for a large range of k > 1, 


nak — "|| = 9, — || < BollOn — 4*|| exp(—aogllog(n + k) — 1 — log(n))) 


Tn+k 
7 n \—009 (8.34) 
= Bolo, — "|| exp(oog)(—") 


If ogg is not sufficiently large, then the convergence of of) | to 6* may be very slow. In this case it 


doesn’t make much sense to worry about the variance of the parameter estimation error: it is the 
slow transient behavior that we should worry about. 
Lemma 8.4 tells us that choice p < 1 leads to much faster convergence: 


9, — "ll < BollOn — 6*|| exp(oog(1 + tn) exp(—eog—(n +k +1)" (8.35) 


mee 
This is not geometrically quickly, but the right hand side vanishes faster than k~”® for any R> 1. 
Does this mean we are forced to settle with a sub-optimal convergence rate in order to cope with 
transients? 


Fear not! So far we have learned that each design choice for a, = g/n° faces challenges: 


(i) We choose p = 1 to obtain the optimal convergence rate for the mean-square error. However, 
poor transient behavior might result in very slow convergence, unless g > 0 is sufficiently large 
(which creates its own problems). 


(ii) The “high gain” choice a, = g/n?’, with p < 1, results in the much better bound (8.35). 
However, the mean-square error of the parameter estimates converges to zero at rate much 
slower than 1/n. 
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PJR averaging is a means to obtain the best fea- 
tures of each of these two cases. Motivation can be 
found in Fig. 8.2, showing typical behavior of param- 
eter estimates for a scalar SA algorithm, with three 
choices of p. All three estimates converge to the com- 
mon limiting value 6* = 0. We see that the estimates 
show greater volatility for smaller values of p, as pre- 
dicted by the theory. Note however that the estimates 
all appear to fluctuate about 6* = 0. This motivates 
additional smoothing in (8.10b), where the starting Figure 8.2: Comparison of three values of 
point No is intended to be chosen “after the transients p for the step-size an, = g/n?. 
have settled down”. 


6 8 10 kx10° 


This simple smoothing of the estimates results in the optimal convergence rate under mild 
assumptions. The technique is illustrated with an example in Section 8.3, and analysis is postponed 
to Section 8.6.3. 


8.2 Examples 


The examples that follow are intended to make more concrete the themes outlined in the previous 
section. 


8.2.1 Monte-Carlo 


Suppose we wish to estimate the mean 6* = E[c(®)], where c: OQ — R%, and the random variable 
® has density p. That is, 


9* = / eee 


Markov Chain Monte-Carlo (MCMC) techniques include methods to construct a Markov chain ® 
whose steady-state distribution has density equal to the target density p [115, 12]. Computation of 
the mean 9* is then an SA problem: 


O= f(*) = EFO, ®)] = Ele(®) — 
Consider the SA recursion (8.2) in which a, = g/n, with g > 0: 


a ee a yple@tn +1))—On], n>0 (8.36) 
This recursion and the associated ODE are linear: 
a = —g[9 — 6] 


This ODE is globally asymptotically stable for any g > 0. The case g = 1 is very special: 


Proposition 8.5. For the special case g = 1, the estimates obtained from (8.36) can be expressed 
as the sample path average: 


On = — )_ c(®(k)) (8.37) 
k=1 
This representation holds regardless of the initial condition 0. 
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Proof. Multiplying each side of (8.36) by (n +1) results in a recursive representation of the scaled 
parameter sequence {S;, = k0,:k > 0}: 


Sn4i = (n+ 1)0, + [c(®(n + 1)) — On] = Sp, + c(®(n + 1)), n>0 
The representation of 0, = S;,/n as a Monte-Carlo average (8.37) follows, since Sp = 0. Oo 


00 | 


Figure 8.3: Asymptotic covariance for Monte-Carlo estimates (8.36) for the scalar recursion. 


Optimizing the gain The recursion (8.36) converges to 6* subject to mild conditions on the 
Markov chain, and asymptotic statistics tells us something about the rate of convergence. 
The covariance ©, in (8.5) is a scalar in this case: 


o2 = trace (En) = En = E[6;,] 


which typically admits the approximation 


Can a, ton) 


where the asymptotic variance a is the solution to (8.28a) with A = —1, and or — 4 defined in 
(8.27). The Lyapunov equation admits the explicit solution shown in Fig. 8.3 in this scalar setting. 

However, this solution is valid only if g > 5. Assuming a. is non-zero, the asymptotic variance 
is infinite for g < $ and minimized using g* = 1. 


8.2.2 Stochastic gradient descent 


Consider the minimization problem 6* = argmingI(0), with T(@) = E[I'(6,®)], @ € R*. The 
gradient descent ODE is defined by 
49 = -VI(8) (8.38) 


An approximation of gradient descent can be realized as SA, with 
F(0) = —VoEIP(8, )] = -E[Vol'(6, ®)] 


The exchange of derivative and expectation can be justified under mild conditions. 

Suppose we are given samples ® = {®,, : n > 1} for which the distribution of ®, converges 
to that of ®, as n — oo, and denote [(0) = T'(6,®,) for any n and 6. The basic algorithm (8.2) 
results in stochastic gradient descent (SGD): 


On+1 = On = An+1Voln41(9) (8.39) 
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Stability of the algorithm Provided T is strongly convex (recall (4.25)), Prop. 4.7 tells us 
that the gradient flow (8.38) is exponentially asymptotically stable. In the proof it is shown that 
V(0) = 3||0 — 0*||? serves as a Lyapunov function: 


4 (81) < —250V (81) => ||9 — 6*l| < ||90 — 6*l| exp(—sot) (8.40) 


Consequently, the bound (8.33) holds with Bop = 1 and 09 = 69. Conditions for exponential 
asymptotic stability without convexity can be obtained from Thm. 4.9. 


What about stability of the SA algorithm? A sufficient condition is based on the ODEQ@ou, just 
as in the deterministic setting of Section 4.8.4: define for 6 € R¢, 


r°(9) = lim ST(ré), VP (8) = lim —VT (rd), 


roo 72 roo 7 


Assuming these functions are finite valued, it follows that the first is radially quadratic, and the 
second is radially linear: 


r©(s6) = s°7™(6), VI" @b)\=sVl™ (6). s>0 
If T is strongly convex, then (4.25) implies the lower bounds 
ATVT™ (8) > dollAl|?, T° (8) = Fool]? 


The first bound follows directly from (4.25) and the definition of VT°°. The second bound follows 
from the first and the following: 
Lemma 8.6. If T° is continuously differentiable, then T°°(0) = 50TVT™ (0), 0E RY. 


Proof. Differentiability justifies the chain rule in the following: 


1 1 1 
r°(g) = / 4 7>°(49) dt = / OTVT™ (0) dt = OTVT™ (6) / t dt 


The first equality is the fundamental theorem of calculus, and the identity [°(0) = 0. The final 
equality follows from radial linearity of the gradient. Oo 


The ODE@oo is defined by tye = —gVI™(9°%). Convergence of the stochastic recursion 
follows from stability of two ODEs: 

(i) If VI'°° is continuous, and T° (6) > 0 for 6 4 0, then the ODE@oo is globally asymptotically 
stable. Theory in Section 8.6.1 implies that the sequence of parameter estimates from the SA 
algorithm (8.39) are bounded: 

sup ||On|| < co a.s. 
n 


(ii) If the assumptions of (i) hold, and in addition (8.38) is globally asymptotically stable, then 
the estimates are convergent to the unique root VT (6*) = 0. 
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Asymptotic covariance The asymptotic covariance defined in (8.6), or any of its variants, can 
all be expressed in terms of the linearization matrix A = Ogf. For this class of algorithms, it is a 
symmetric matrix with entries 


OC 2 a 


= a0, (9) — ~ 50,00, (0) (8.41) 


Variance theory requires that A(6*) is Hurwitz (recall (8.28a) and (8.28b)), which in this case 
means that the Hessian of I is positive definite at the optimizer: —A(@*) = V°T (6*) > 0. 


Tension between asymptotic and transient performance Let’s first consider a setting where 
the tension is not so severe: [ is a quadratic, with [(0) = $(0 — 0*)'G(0 — 6*), and G > 0 (positive 
definite). This is strongly convex, with 69 = A1, the minimum eigenvalue of G. 

Consider the step-size a, = g/(n + ng): 


> Transient bounds: The right hand side of the bound (8.34) will decay as 1/n when using the 
step-size a, = g/n and g = 1/A. 

> Asymptotic covariance: The linearization matrix used to compute the asymptotic covariance 
is A(#*) = —G. The gain g = 1/A, will result in a finite asymptotic covariance since all 
eigenvalues of gA satisfy A(gA) < -1 < —5. 

> ODE fidelity for small n: We may require no >> g to avoid massive transients for small n 
(or employ re-start to keep the parameters within a pre-defined region R, as described at the 
close of Section 8.1.2). 


Theory surrounding PJR averaging motivates the use of a, = g/n? with a smaller value of g 
(say, g = @ if a good choice is available), and p < 1. This approach requires far less “tinkering” 
with parameters. It is true that we are left with three: p, g, and the integer No used to obtain the 
final estimate 057 via (8.10b). Theory and practice suggests that sensitivity to p < 0.9 and g > 0 
is not so high, and choice of No should be clear after observing sample paths from various large 
initial conditions. 


Or, we might give up on SA entirely. An alternative to SGD is considered next. 


8.2.3. Empirical Risk Minimization 


Recall from Section 5.1 our brief survey of empirical risk minimization (ERM). This approach can 
be applied here, with the empirical risk equal to the empirical mean: 


_ ic 
Tw (8) = 5 S158) (8.42) 
i=l 


Application of ERM means that we forego a recursive SA algorithm, and instead minimize Ty (0) 
to obtain the estimate @);°“. This approach has become popular in recent years for optimization 
in very high dimensions. 

For sake of illustration, consider the special case of quadratic loss [;(@) = $||M;0 — &{|?, in 
which the matrix M; and vector &; are random. We then have VI;(@) = Mj} (Mi0 — &), so the SA 
algorithm is defined by 

Ont = On — an+1M) 1 {Mnt19n a Enti} (8.43) 
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The ERM solution can be obtained in closed form, by applying the first order necessary condition 
for optimality Vly (@) = 0, where 


N 
= 1 

= TL \.0 — €. 

Vin (8) = + 3 M} {M6 — &;} 

Consequently, provided the inverse exists, 
N 1 

oy = (SoMIM) SOME (8.44) 

i=l i=l 


Prop. 8.8 below shows that 047°“ coincides exactly with a particular SA algorithm using a 
carefully constructed matrix gain: 


On44 = On _ On41Gn41 MM) {Mns1981 = Gaet (8.45) 
in which 
1 n+1 
-1 
Gi = wei (MIM) ,  n>0 
— 


It will follow from Prop. 8.8 that #4“ is in an estimate of 6* with minimum covariance. 

Outside of some special cases (such as very high dimension), ERM should be regarded as a last 
resort. ERM and Zap each have approximately the same covariance as the PJR estimate 05) defined 
in (8.10b). Given the simplicity of the recursions defining 0%, this is probably the first algorithm 
to try in most applications. 


8.3 Algorithm Design Example 


The purpose of this section is to illustrate algorithm design with a simple nonlinear example. We 
will eventually arrive at a very efficient algorithm for this test case based on PJR averaging. 
The function appearing in the SA recursion (8.2) is defined by fn4i(@) = f(0, ®(n + 1)) with 


f(0,®) = —(6 + 3sin(A)) + 10 cos(@)® 
where @ and © are scalar, so that f: R? > R. With ® zero mean we have 
F(0) =E[f(@, ®)] = —(@ + 3sin(9)) 


The plot of f in Fig. 8.4 shows that 9* = 0 is the unique root. 

We can integrate this function to obtain the representation f = —VI. The function ['(@) = 
50° — 3cos(@) is also shown in Fig. 8.4, where we see that it is quasi-convex and coercive. The SA 
algorithm is an instance of stochastic gradient descent to obtain the minimizer of I. 


8.3.1 Gain selection 


For the choice of step-size a = g/n, the gain minimizing the asymptotic variance is g = —1/A(6*) 
in this scalar example. We have A(@) = Oo f (0) = —(1+3cos(6)), giving A(6*) = —4 and g* = 1/4. 
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When considering transient performance of the SA algorithm, we will see that this is a terrible 
choice. 

The transient bound (8.34) is based on global exponential asymptotic stability in the form 
(8.33), which is obtained through Lyapunov theory: with V(@) = 50, 


£V (8x) = —97(97 + 3sin(9-)) 


The right hand side is negative whenever 9, 4 0. To obtain a bound in terms of V, consider the 
worst case: 


_ 1 pl ; 
min Vi 99 + 3sin(@))] = 2 min AG + 3sin(0)) 


=2+ 6 [min sinc(@)| > 0.68 


where the final bound comes from sinc(@) > —0.22 for all 6. Denoting a9 = 0.68/2 = 0.34, we 
obtain the Lyapunov drift condition £V (8x) < —209V (8x), leading to V(8r) < V(9o) exp(—2oot), 
and by definition of V, 
|9: — 6*| < [80 — 6*| exp(— aot) (8.46) 
Hence (8.33) holds with Bo = 1. Fig. 8.4 indicates that g9 = 0.34 is approximately equal to the 
largest value such that f(@) > —0.340 for @ <0, and f(@) < —0.346 for 6 > 0. 
The bound (8.34) becomes 


—0.34 6 /T(@ 
Oe, — 81 < [On — O* | exp(0.349)(—2 5)" NA pe 
10+ 
The right had side converges to zero at rate 1/k with 5f 
g = 1/0.34 = 3. a nd 
The value g = 1/4 may lead to very slow transient Aol 
performance, even if the mean-square error converges to A157 3, 
zero at its optimal rate. This is illustrated in Fig. 8.5. me ne 


Before discussing the figure we revisit the topic of asymp- 


totic variance. Figure 8.4: f = —VI. 


8.3.2 Variance formulae 


What can we expect if we increase the gain in the step-size rule a, = g/n? A variance formula for 
g > g* /2 follows from (8.28a), and appears similar to the variance formula for Monte-Carlo: 


2 1 gf 2 
0f = 3(—2_, )o3, (8.47) 
2X g/g* — 4 


The value g = 1/0.34 is nearly 12 times larger than the value g* = 1/4 derived using asymptotic 
theory. The formula (8.47) can be applied to establish that op is increased by more than 6-fold 
with the larger gain. 

If we instead opt for an, = g/n”, with p < 1, then the definition (8.6) results in 


2 1: p _ x2 
t, = Jim n E[|@n — 0*|7] (8.48) 
The formula is simpler in this case, and is valid for any g > 0, and any p € (0.5, 1): 
= 599 OX (8.49) 
See Section 8.6.2 for details. 
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8.3.3 Simulations 


Up until now we have said nothing about the disturbance ®, except that it has zero mean. The 
plots that follow used an i.i.d. Gaussian process with ®, ~ N(0,10). For i.i.d. disturbance, the 
variance term appearing on the right hand side of (8.47) is defined by 


o% = E[{f(0", Bn)}?] = El{10 cos(0*)#,,}7] = 100 


With step-size a, = g/n, the formula (8.47) gives 


gon = (5/2)? g=g =1/4 
e 2( g )o3 ~ 6 x (5/2)? = 1/0.34 
2 fos A™~ g= . 
— Oz, - p=1.0 ei p=0.9 7 p=0.8 
80 80 80 
steed $x, gal ty =2.8 TN = 5.3 es tw=11 
40 40 40 
20 20 20 
0 0 Ob nen — 
0 5 10 Tk 10} 5 10 Tk 
~-- Gaussian 60 5 
40 09 = 3 : 


VNOn 


5 
40 %- TB 
i 2/2 
VN?ONn ' 
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Figure 8.5: The first row shows a comparison of the parameter estimates from SA, and the sequence obtained from 
the Euler approximation (8.50), with common initial condition 69 = 100. The second row shows that the CLT holds 
for p = 1, and the theoretical variance predicts what is observed in simulations. The third row illustrates the CLT 
for all three values of p, using the required scaling of the parameter error by VN. 


Transient and asymptotic behavior Fig. 8.5 shows results from SA using a, = g*/n? for 
three values of p. The columns are distinguished by the value of p. The plots shown in the first 
row compare the output of the algorithm ©,, = ), with the deterministic Euler approximation: 


Drnit = dx, = On+if (8x) ’ n= 0, do — Ao (8.50) 


The approximation ©,, +91, appears to be very accurate. The high accuracy is due in part to the 
large initial condition, since the fluctuations of 6, are small compared to its magnitude (the plots 


Pre-publication draft -- March 25, 2022 


CHAPTER 8. STOCHASTIC APPROXIMATION 297 


shown in Fig. 8.2 are for the same example, but with 0) = 1). These plots are shown on the “ODE 
time scale”: the x-axis is T, rather than k, so that the final time ty depends on the step-size: 


N 
TN = ) Ak 
k=1 


The value N = 10° was chosen in this experiment, so that a; = g*/k? results in the values of ty 
shown in the top row of Fig. 8.5. We see that $:, remains far from the equilibrium value 6* = 0 
when using p = 1. 


60 


--- Gaussian 


VNOn 


40 


VN? 6N 


-20 i?) 20 


Figure 8.6: Histograms for the scaled parameter estimates using g = 1/0.34. 


The histograms shown in Fig. 8.5 were obtained based on 500 independent runs, time horizon 
N = 10°, and with initial condition sampled independently 0) ~ N(0,1) (so that the transient 
behavior has less impact). The normalized error at the end of the run appears approximately 
Gaussian in all cases. 

For the larger gain g = 1/0.34 the situation is very different: |9,,,| is very close to zero for 
T, > 3, and for any value of p < 1. Fig. 8.6 shows histograms obtained using this larger gain. We 
observed earlier that (8.47) results in og © 6.2 (9 =1). The formula (8.49) yields og © 6 (p < 1). 


PJR averaging We finally illustrate the value of averaging. The parameter estimates used to 
create the previous histograms were averaged according to the formula (8.10b), with (NW —No)/N = 
0.4. Fig. 8.7 shows histograms of the normalized parameter estimation error for these smoothed 
estimates. As in the prior experiments, the columns are distinguished by p, and the two rows are 
distinguished by choice of gain g in the step-size rule a, = g/n?. 

The results are not so sensitive to choice of gain. Theory in Section 8.6.3 tells us that Polyak- 
Ruppert averaging will result in the optimal convergence rate for the MSE, provided the step-size 
Qn = g/n? is used with p € (0.5,1) and g > 0. 


8.4 Zap Stochastic Approximation 


Given the amazing properties of the Polyak-Ruppert averaging technique, why would we ever 
consider a matrix gain algorithm? Three reasons can be found from a review of the book up to 
now: 
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Figure 8.7: Histograms obtained using Polyak-Ruppert averaging, for two different choices of g. 


(i) The most compelling motivation is Prop. 4.4, which suggests we obtain a consistent algorithm 
under very mild assumptions on f. 


(ii) The transient behavior is ideal: recall from (4.12) we obtain f(9;) = f(80)e~*, which suggests 
good numerical properties for an SA algorithm. 


(iii) What could not be predicted in the first part of this book is that the asymptotic covariance 
of this algorithm is also optimal (under mild assumptions on f and {f,}). 


8.4.1 Approximating the Newton-Raphson Flow 


This algorithm is designed so that it approximates the regularized Newton-Raphson flow introduced 
in (4.15): 
4-9, = —[eI + A(8:)™A(81)J-1 A(2) £81) 


Zap Stochastic Approximation 


Initialize 0) € R¢, Ag € R?&*4 ¢>0. Update for n > 0: 
Anti = An + Bn+1 [An+1 = An] An4i = 06 fn-+1 (On) (8.51a) 


def 


Oni = 0, + Op Gyan ja Gn) Gat — —lel ob Alo Apa) (At ag (8.51b) 


The two step-size sequences {a,,} and {8,,} satisfy (8.22): 


Under the “high gain” assumption (8.22) we expect that Aj 41 is a reasonable estimate of A(6n) 
for large n, with A(@,,) defined in (8.13). This intuition is supported by the general theory of two 
time-scale SA, summarized briefly in Section 8.1.4. In particular, the high-gain assumption for 
(8.5la) tells us that we can expect the approximation An ~ A(@,,) after a transient period. The 
ODE approximation for {6,,} is precisely the regularized Newton-Raphson flow. 
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8.4.2 Zap Zero 


In most of the algorithms considered in later chapters, the matrix A,+, appearing in (8.51a) is of 
low rank. For the special case « = 0, the matrix update for G,+1 is efficiently computed using the 
Matrix Inversion Lemma (A.1). This is particularly simple when the rank of A,,+1 is equal to one, 
and defined in factored form Anyi = Wr4iV,) 4, with W,41 and V,41 column vectors. In this case, 
(A.1) gives the update equation 


1 


Gioag HE SAa| Se, 
+1 [ +1| To Vi«CnWaa 


The right hand side involves the two matrix-vector products, G,Wy+1 and VI 414Gn, which intro- 
duces complexity of order O(d?). 
An update of complexity O(d) (and even O(1)) is possible under stronger conditions. 


Zap Zero Stochastic Approximation 

Initialize 0) , wp € R¢. Update for n > 0: 
On41 = On + An41Wn (8.52a) 
Wnt1 = Wn + Br4i{AntiWn + fn41(On)} An+1 = Oofn41 (Ont) (8.52b) 


The two step-size sequences satisfy (8.22). 


The multiplication A,;,41W,, introduces additional complexity of order at most d. In applications 
we often find that A, is sparse. For example, in Watkins’ algorithm (9.75) the matrix A, is rank 
one, and also has at most two non-zero entries for each n. This is when we say that the additional 
complexity is zero or O(1). 

The fast time-scale recursion (8.52b) admits an ODE approximation with 6, frozen: 


ur = A(A)ws + F(A) 


Provided A(@) is Hurwitz, the ODE is globally asymptotically stable, with equilibrium w(6@) = 
—A(0)~'f(0). Theory of two time-scale SA predicts that w, * W(0,), and hence (8.52a) is ap- 
proximated by an Euler approximation of the Newton-Raphson flow: for a vanishing vector-valued 
sequence {én}, 7 
On+1 = On — an+1[A(On)*F(On) + En+1] 

Proposition 8.7. Consider the algorithm (8.52) under the following assumptions: 

> f is Lipschitz continuous, and A = Of is a bounded and Lipschitz continuous function of 0. 

> The parameter sequences {On ,Wn :n > 0} are bounded a.s. 

> The cumulative disturbance (8.15) vanishes in the uniform sense assumed in Thm. 8.1. 


> A(O) is Hurwitz for each 0 
Then, lim f(0,) =0. O 
n> oo 
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8.4.3 Stochastic Newton Raphson 


The algorithm (8.51) with 8, = a, for each n, and ¢ = 0, is called Stochastic Newton Raphson 
(SNR). For the special case a, = 8, = 1/n, and ¢ = 0, the matrix sequence defined in (8.51a) 
reduces to the average of {A,}. This uniform averaging is not desirable, and in particular hinders 
stability analysis. It is unlikely that this SNR ODE enjoys the same universal stability properties 
as the Newton-Raphson flow. 


SNR for linear SA Consider the linear setting: 


fn41(9) = An419 — brit, f(0) = Ad —b (8.53a) 
The basic SA algorithm is defined by 
On+1 = On + Ant+1 [An+19n = bn+1| (8.53b) 


This is a very special case because A(@™) = E[A,] does not depend upon 6. It is the only situation 
for which relaxation of the assumption (8.22) can be easily justified. 

The proposition that follows considers SNR using an = Bn = 1/n, so that in particular (8.51a) 
becomes 


ee A 1 ~ 
Ani = An - [An+1 An] 


+1 
Recall from Section 8.2.1 that the solution to this recursion can be expressed as an average: 


1 n+l 
Ania = ae | 2, Ants 


Proposition 8.8. Consider the SNR algorithm for the linear model (8.53a) in the form: 


1 


On41 = On — —— 
n+1 


Ay [Anti 9n = bn+1] 
x 1 oe n+l (8.54) 
Ae te {Ao = 2 As} 


Suppose that the dx d matrix AG is chosen so that An is invertible for eachn. Then for eachn > 1, 
regardless of the initial condition 6, 


An example is the minimization of a quadratic loss function with noisy observations, in the 
setting of Section 8.2.3. This is the linear SA algorithm (8.53b) with A, = —M} Mp, bn = MiEn, 
and A(#) = —E[M}M,,] < 0 by assumption. We also have Mj] M,, > 0 for all n, so any initialization 
A < 0 will satisfy the assumption of the proposition. Prop. 8.8 implies that SNR is essentially 
equivalent to ERM. 

For the more general (non-quadratic) optimization problem (8.42) we cannot claim that 
is identical to the solution obtained using SNR (we do not even know if stochastic gradient descent 
using SNR will be consistent). Algorithms can be obtained using Zap SA or PJR averaging to 
obtain the same asymptotic covariance as ERM. 


ERM 
On 
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n~n 


Proof of Prop. 8.8. Multiply each side of the recursion for {6,,} in (8.54) by (n+ 1)An4i to obtain 


(n+ 1)Anp1On41 = (2+ 1)An419n — [Anti9n — bn+1] 
By definition (n + Age =nA, + An+1, giving 


(n+ 1 Ans19n44 = [nAn + An+i]On — [AntiOn — bn+41] 
= nAnOn + bns1 


Iterating this recursion gives (n + DA cigb cag = pwaEy by for each n > 0. O 


8.5 Buyer Beware 


Stochastic approximation can be difficult to apply because of exotic nonlinear dynamics. This 
is often resolved using Zap SA, while PJR averaging might fail. Two other potential curses are 
described here: condition number, and disturbances with long memory. 


8.5.1 Curse of condition number 


The condition number of the linearization matrix A = Of (6*) is the ratio of maximal and minimal 
singular values: 

Omax(A) 
Omin(A) 
where the singular values {o;(A)} are obtained by taking the square-root of each of the eigenvalues 
of AAT. If the condition number is large then the value of Polyak-Ruppert averaging may not be 
observed until after a very long runlength. 

The example and discussion here is early warning of what might go wrong in reinforcement 
learning. The standard Q-learning algorithm of Watkins typically has a linearization matrix A with 
massive condition number, and examples in Section 9.6 reveal that PJR averaging may provide no 
benefit for time horizons less than 10 million. 

This point is illustrated using the linear SA recursion 


K(A) = (8.55) 


On+1 = On aa An+41{ Abn “F An+i} (8.56) 
in which A is Gaussian and i.i.d., with marginal N(0,/). The matrix is taken to be a simple form: 
A=-—|[I+(K—1)vv"] 


in which ||v|| = 1 and & > 1. The matrix —A has one eigenvalue at « (with eigenvector v), and the 
remaining eigenvalues are unity. The condition number of A is k. 
Let’s now compare two algorithms with optimal asymptotic covariance: 


(i) Stochastic Newton-Raphson: 


Fret = On” + seer On” — AW Anti} ee?) 
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(ii) PJR estimates with carefully selected step-size. One challenge with a large condition number 
is that the step-size must be reduced to avoid initial explosion of the parameter estimates. 
This is required even if there is no noise, so that f = f,. This motivates the modification 


Qn = min(ag,n °) (8.58) 


The value ag = 1/k was reliable in all experiments, with p € (1/2, 1). 


Either algorithm achieves the optimal asymptotic covariance, 


4 =GUAG’ =(4)" SCs reel)t SI vul 


dtr 


with r = «? — 1. In particular, either approach results in the optimal asymptotic variance for each 
component of the parameter: 


lim nE[62°(i)?] = lim nE[6P®(i)?] = D4 (i) 


noo n> Co 


In fact, no limit is required for the Zap SA recursion. This is seen through a refinement of 
Prop. 8.8: first note that when n = 0 in (8.57), the initial condition is cancelled: 


Of = OB + {05 — AA} = AA 


It can be shown by induction that for each n > 1, 


n 


1 
O° =-AT-S oA 
n na k 


k=1 
Hence ./n62? ~ N(0, &*) for each n! 
[ 47 P PR PR 
‘ — 6,(1) — 6n(2) — On(3) =| 07 a, ul). oe 2) On" (3) 
: s| — 286) 4) =o) 


0 10 20 30 40 50 60 70 80 90 100 


Figure 8.8: Comparison of first three parameter estimates: the left hand side shows parameter estimates using (8.56) 
with step-size an = min(1/«,1/n). The right hand side compares Zap SA and PJR-averaging. 


In the experimental results described here, the vector v was chosen as v = z/||z||, with z 
selected N(0,d) with d = 50. It was found that 4$(7) was close to unity for each 7 for the values of 
« considered. 

The exponent for Polyak-Ruppert averaging was taken to be p = 0.75, with averaging over the 


recent 70% of samples: 
i 
PR __ 
on = 0.7n D3 on 


0.3n<k<n 
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Ege [ ZP*(2) 
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Figure 8.9: Histograms of first three parameter estimates. For the Zap SA recursion, the estimates are Gaussian 
with distribution N (0, &*(2,2)) for each i = 1, 2,3. 


Histograms were obtained for {Z* = \/n@?, : 1 < i < 500} (500 independent runs), with initializa- 
tion 03 ~ N(0,07I) using o? = 25. 

For & = 100 the behavior of Zap and Polyak-Ruppert estimates are very similar after just 10? 
iterations. Increasing the value to « = 500 revealed difficulties. The source of the problem is clear 
from Fig. 8.8: The bound a, < 1/K was imposed to avoid explosion of the parameter estimates. 
The figure shows only the first three parameter estimates for each of the three parameter estimate 
sequences. In these closeups, the estimates 6,,(1) and 67"(1) are not even visible since 49(1) is out 
of range and the step-size is so small. 

Fig. 8.9, showing histograms at n = 10%, reveals significantly high variance when using Polyak- 
Ruppert averaging. With n = 10* the situation is improved, and the results from PJR averaging 
are more similar to SNR. 

However, n = 104 samples is no longer sufficient if the condition number k is increased from 
500 to 1000. The empirical variance of each parameter is found to be approximately 14 for n = 104 
when using PJR averaging, while the optimal variance is slightly less than one. 


8.5.2 Curse of Markovian memory 


The preceding example is special because the “noise” is i.i.d., rather than the Markovian noise that 
is typical in RL. To illustrate how memory can present challenges, we consider once more a linear 
algorithm, but this includes “multiplicative noise” : 


On+4 = On a An+1[An41On F PA ; 6 € R, (8.59) 


in which A,41 = X(n+1)—7- 1, where X is a sample of the M/M/1 queue (transition matrix 
given in (6.14)), and A is iid. N(0,1) and independent of X. Recall from Section 6.7.4 that the 
Markov chain X has many desirable properties, provided > a (the load condition for the single 
server queue). In particular, it is reversible and geometrically ergodic, with geometric invariant 
def 

pmf and 7 = E,[X(n)] = a/(u— a). - 

The ODE approximation for (8.59) has linear vector field f(@) = —0, so we expect convergence 
to 0* = 0 from each initial condition. Moreover, given that the estimates converge to zero, we 


might attempt to obtain a CLT and moment bounds through the following representation: 


On+41 =O)-- On+1[—On + Did) ; 00 € R, 
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where Ros = Anyi t+(X(n+1)—17)6,. Since 6,, converges to zero, this suggests we might attempt 
to treat this as a scalar version of the linear SA recursion (8.56) with A = —1, for which the CLT 
holds with asymptotic variance equal to one. 


~L PR - 4 L Za. a BL PR { = 4 Za: 4 
ea Polyak Ruppert | a Zap; St Ze Polyak Ruppert - a Zap | 
ut { | I | 
SL cs 
J | so oo 
3° ~ dt. 3° ba | 
_ ‘ ’ ‘i - ? 
’ ~ nm q an 
fae eet as. a 1 
r 4 a x 
he ae i = ae = = ae 
5 0 5 -5 0 5 500 0 500-5 0 5 


Figure 8.10: Histograms of the normalized error for the scalar SA recursion with Markovian multiplicative distur- 
bance. The time horizon was 10* for the smaller load, and 10° for load 6/7. 


However, the difficulties discussed in Section 6.7.4 concerning estimation of the steady-state 
mean have a parallel here. If the load satisfies p = a/p > 1/2 then E[62] — 00 as n > 00 using 
Qp, = 1/n, which corresponds to Zap SA for this simple example (see the Notes for references). It 
is likely the second moment is also unbounded for estimates obtained using PJR-averaging. 

Fig. 8.10 shows some positive news: histograms of the normalized error Z;, = \/nO, appear to be 
nearly Gaussian (0,1) even for high load. What is missing from these plots is outliers that were 
removed before plotting the histograms. For load a/j = 3/7 the outliers were few and not large. 
For load a/ = 6/7, nearly 1/3 of the samples were labeled as outliers for both PJR-averaging and 
Zap, and in this case the outliers were massive: values exceeding 107? were observed in about 1/5 
of runs. 


8.6 Some Theory* 


This chapter is technical, but intentionally incomplete. The main challenge is space—theory of 
stochastic approximation takes a sizable book of its own, as prior books will attest. The details of 
stability and convergence theory are skipped because these results follow the same steps as in the 
theory of Quasi-Stochastic Approximation from Section 4.5. 

The purpose of this section is to highlight the parallels between QSA and SA theory, and point 
out some differences (especially for analysis of PJR averaging, which takes up most of the space in 
this section). 

To streamline discussion we will impose strong assumptions. 


(SA1) f is Lipschitz continuous. 
(SA2) The sequence {A,, : n> 1} is a martingale difference sequence: 


E[Apat | Fpl = 05 n>0 (8.60a) 
with F,, = o{ ®(k) : k <n}. Moreover, for some 44 < ov, 
E[|Ansill? | Fa] $ FA(1 + [1nll?), 2 20 (8.60b) 


(SA3) The sequence {a,, : > 1} is deterministic, satisfies 0 < a, < 1, and 


[oe] [oe] 
S° fy = OS, ya, < 00 (8.61) 
n=1 n=1 
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O 


See the Notes section to find sources for much stronger conclusions under weaker assumptions. 
Much of this theory is very recent. 

In particular, Assumption (SA2) is imposed here for simplicity: it is not essential for the 
convergence theory for SA [39, 204, 65, 67]. This condition holds when ® is i.i.d., and also for 
TD-learning and Q-learning in special cases (see for example Prop. 9.16). 


8.6.1 Stability and convergence 


If you have read Section 4.9 then you know that convergence of SA is not difficult to establish, once 
you have determined that the estimates {@, :n > 0} are bounded. 

Conditions for boundedness follow the stability theory for QSA based on a Lyapunov function, 
or via the ODE@oo recalled here: 


aot = foo(91) (8.62) 
with fo. defined in (4.138): fx.(0) = lim fr(0) = lim r | f(r) for any 6 € R?. 


The first row of Fig. 8.5. might be viewed as illustration of the approximation of SA by the 
solution to the ODE 49 = f,(9), with r0 = 0 = 100. 

The Lyapunov criterion is (QSV1): for a differentiable Lyapunov function, this is equivalently 
expressed 

£V (92) < —d9]||94|| , whenever ||9:|| > co 

Theorem 8.9. Suppose that (SA1)-(SA8) hold, along with one of the following two conditions: 

(i) (QSV1) holds with V Lipschitz continuous, or 

(ii) The origin is asymptotically stable for the ODE@oo. 
Then the SA recursion is ultimately bounded in the following sense: there exists dg < oo such that 


for any initial condition, 


lim sup ||@n|| << as. and limsupE[||9,||?] < a? (8.63) 
n—0o N00 


O 


The proof of the almost sure bound in (8.63) is identical to the proofs obtained for QSA. The 
mean-square error bound requires more work. 


8.6.2 Linearization and convergence rates 


Section 8.6.3 contains a complete analysis of the asymptotic covariance when using PJR averaging, 
which begins with the approximation: 


Fn+1(On) = A(On — O*) + Angi + Ef (On) (8.64) 


Ex(0) = f (0) — Alo — 0*] error in a first-order Taylor series approximation for f (8.65) 


The difficult work involves mean-square bounds on the two terms [An+i _ | and €¢(@n). 
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To expose the most interesting concepts we consider a special instance of the linear SA recursion 
(8.53a) for which Ef is eliminated. Consider the simplest setting: 


with A Hurwitz, and 6, = 6, — 6". For analysis, it is convenient to express the linear recursion in 


terms of the error: - - - 
On+1 = On + On41{AOn + Anti} (8.67) 


The scaled error is denoted 


ep «€CULLlCx i 1» ~ 
Zn = atm and ¥% SE[Z,Z7] = 5 ElOnPn (8.68) 


Proposition 8.10. Suppose that {A,} satisfies (SA2), and that the covariance is independent of 
n. Then, the following approximations hold: 
(i) With an =1/(n+n0), 


1 
Z Zy 
i ay, I n+n0 


{(A + 52 +En)uR +EZ(A+ 51+ En)T + Ea) } (8.69a) 


with ||E,|| = O(1/n). If in addition the matrix A + 51 is Hurwitz, then =2 converges to Do, 
and this limit is the solution to the Lyapunov equation (8.28a). 


(ii) With an =1/(n+no)?, using §<p<1, 


oz =a + {(A LEA AE Aes Ea) } (8.69b) 


n+no 


with ||Ep|| = O(1/n). If in addition the matrix A is Hurwitz, then SZ converges to Sg, and 
this limit is the solution to the Lyapunov equation (8.28b). 


Proof. In both parts we simplify the calculations by setting no = 0. This is justified since with 
pe (5, 1] and no > 0 fixed, we have for any n > 1, 


1 i, O ( 1 ) 
(n+19)? ne nite 
(i) The approximation (8.69a) begins with the Taylor-series approximation 


1 
Vint = Vint se + O(n )— 


Multiplying both sides of the recursion (8.67) by \/n + 1 results in a recursion for Z: 
al 
Zatt = Ent -{(4 LIP eegf\Z, Jat TAnsi } (8.70) 
with €, = Oln-*). Taking outer-products on both sides of the recursion, and then taking expec- 


tations, we obtain the recursion (8.69a) after simplifications. It is in this step that we use the fact 
that A is a martingale difference sequence, so in particular 


E[Zn Anal = E[An412,] = 0 
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The recursion (8.69a) can be viewed as a linear stochastic approximation algorithm of the form 
(8.53b), subject to multiplicative noise that is O(1/n). Convergence follows under the Hurwitz 
assumption on A+ 51 ; 

The proof of (ii) is identical, following the Taylor series approximation 


(n+ 1)0/? = nel? + O(n-)ne/? 


Multiplying both sides of the recursion (8.67) by (n + 1)?/? results in a recursion for Z: 


1 
Zsa = (14 en)Zn + easy {Alt + én) Zn + (n+ 1)? Ansa} (8.71) 


where é; = O(n‘). The conclusions of (ii) follow. O 


8.6.3 Polyak-Ruppert averaging 


We conclude with an approximation of the error covariance obtained using PJR averaging, denoted 


Sn = El{O;, — O° }{6, — O°} 


x def 


The goal is to justify the approximation nXh*® = 4, where 44 = AS x, (A7?)T denotes the optimal 
asymptotic covariance matrix. 
The following two simple lemmas are our starting point. 


Lemma 8.11. The Polyak-Ruppert estimate can be expressed 
1 


oy Oe Mo [Mee — 54+ 5% + SA] (8.72) 
where A= A(0*), 
N N 
My = >> Inti (?") = > paar (8.73a) 
n=No+1 n=No+1 
N 
Sy = > Fn41(%%,) (8.73b) 
n=No+1 
N 
Sv= S> E;(65) (8.73c) 
n=No+1 
a N 
SR= S> {Anti —- AM} (8.73d) 
n=No+1 
with E-(@) is defined in (8.65), and as in (8.3b), 
Anti = fn4i(63,) — F(F},) (8.74) 


Proof. By the definitions 


fn+1 (On) = F(On) + Anta = AlO;, — 0] + Anta + Ep (Gn) + {Anti — Ania} 


The representation follows on multiplying each side of the equation by A~!, summing over n, and 


rearranging terms. O 
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Under general conditions the first term (8.73a) dominates, and from this we obtain 


(N — No) ESF = ACov( M2?) AT + 0(1) = E4 + 0(1) (8.75) 


N= NG 
It isn’t difficult to see that two of the other terms should be relatively small: 
(i) E¢(Os,) = O((||O, — 6*||?) when f is smooth. 
(ii) We can only expect Anz — A&%, = O(||9%, — 9*||) (without the square). It is the sum Sf 
that is small (compared to My) under (SA2). 
It is a simple miracle that {S J} is small relative to {My}, given their very similar definitions. 


Bounds on {st} are made possible by another representation, obtained by first rearranging the 


terms in (8.10a): 
1 ix ~e 
fn41(97,) = Basa [One _ 0, (8.76) 
n+1 


with 0°, = 0 —@*. A useful representation for the sum is obtained using this simple transformation: 


Lemma 8.12. (Summation by Parts) For any two real-valued sequences {%n, Yn :n > 0} and 
integers O< No < JN, 


N N 
> fn(Yn — Yn—1) = ENGIYN — ENy41YNy—- D_ (fn41 —En)Yn o 
n=No+1 n=No+1 
Consequently, 
N N 
ae, ~. 1 x 1 x 1 1\~ 
Sh = SS B [On 0, = B On B No » © ~ a) os (8.77) 
a N+1 Not+l naNo+1 Sent ie 


The assumptions imposed below imply strong bounds on each term. 
Assumptions for PJR Averaging: For fixed constants ba, by, bz, and every n > 1, 
(PR1) The step-size sequence is 6, = n~? with $ ae 
(PR2) E[AR APT] = Ya + O(e”) for some @ € (0,1). 
E[Ansi | Fp] = E[AZ, | 7] = 0, and 
E[| Ang — AMI? | Fn] < ba lll? 


(PR3) ||E;(9;,)|| < br ||, I). 

(PR4) E[||Z,||*4] < bz, with Z, = n/26,,. 

Once again, the martingale difference assumption in (PR2) is stronger than required, but allows for 
relatively simple calculations. The fourth moment for Z is not difficult to establish under a slight 


strengthening of (PR2): 
E[|| Ansa — AM all? | Fal < Os lI" 


for a constant b/,, along with either (i) or (ii) of Prop. 8.9. 
The optimal convergence rate for PJR is now easily established. For simplicity we take No = 0 
in Thm. 8.13. 
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Theorem 8.13. Under (PR1)-(PR4), consider The PJR scheme obtained for arbitrary N > 1 
and No =0. Then (8.75) holds in the form 


r_ tye ye 
N WN + Un 


where X§ = ADAAT, ||Z4|| < O(N yi and with 6 = min(3(1—p), p—5, p/2) > 0. Consequently, 
the asymptotic covariance for {0} is optimal. 


We require a simple companion to summation by parts: 


Lemma 8.14. The following bounds hold under (PR1): 


Ny 4 ; . 
) pay ed _ / _ f.. 7 


Proof. Each bound is obtained by comparing a sum to an integral. Details for the first bound 


follow: 
N 
d 


£|vR= So [(n +1)? = nPjn-01? 


er 
N 
- p> [a Ann? + O(1) 
n=1 
N 
= | a 1+e/2 de + O(1) 
z=1 
1 N 
= o( Sys”) |) +011) = 20”? +010 
The proof of the second bound is similar. O 


It is convenient to use vector space notation through the remainder of the proof: for any vector- 


valued random variable Z, 
3 1/2 
Zl = (Xe E[Z(i)"I) 


To refine the bound (8.75) (with No = 0) we begin with (8.72), written as 
a 1 1 ae 
O¥=Vn—-En, where Vy=—T AMR and Ey = —A™ [s,, - sh — 98] 
Lemma 8.15. The covariance admits the approximation 
= Cov(Vy) + Shy 


where Cov(Vy) = vee + O(N~*) 
UN Sent — with en = 2\|VnllallElle + En 
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Proof. Note first that the approximation for the covariance of Vy follows immediately from (PR2). 
Next we obtain by the definitions, 


= E[(Va — En) (Vn — En)"| = Cov(Vy) + Diy 


with 
a = E[EnV}, + VnEn + Enel | 
The remainder of the proof is concerned with a bound on the maximal eigenvalue of this error 

term: 

max uv! hiv 

l= 
For any v € R? we have 

ul Su = E[2(vTEw)(vT Vy) + (vtTEw)?| 


By the Cauchy-Schwarz inequality for vectors we have |v'’Ey| < ||Ey|| and |uTVy| < ||Vy|| when 
||v|| = 1, and then by the Cauchy-Schwarz inequality for expectation of products of random vari- 
ables: 


vTEu < 2l|ValallEwll2 + llEwlls = ew 0 
Proof of Thm. 8.13. We have from the definitions and the triangle inequality, 
1 A = Ay 32 
en < 2||Ylazgll (Silke + |Sillo + SHll2) +N? (ll Sillo + Silla + IS ll2) 
and also ||Vy||2 < O(N-1/?), giving 
= A = Ay \2 
en < O(N~*”) (| Sh ll2 + [Silla + |SMll2) +N? (Sqrllo + Silo + I SHll2) 
To complete the proof we now show that 
N~8? (Sh llo + Sw llo + SHll2) < OWN-**) 


where 6 is defined in the proposition. It will then follow that ey < O(N~!°). 
> N-~3/2I| SF Ip < O(N71-20-9)) < O(N7!~): Under (PR4) we have by the definition of Z, and 
Jensen’s inequality 


ENA] <o282, ELI?) < VbzBn (8.78) 
The triangle inequality applied to the representation (8.77) gives 


1 5 Li aes =) 1 1 |x 
Shilo < =—lOvllo + 190ll2 4 On, 
[Sell < Worle + Nola + Do a — Nal 


b/d JBN [|Moll2 = 
Lo By p> 


-2 vm 


in 


The right hand side is bounded by a constant times N?/2 by Lemma 8.14. 
p> N-3/2\1ST Ilo < O(N-1--2)) < O(N-1*): Applying (8.78) and Lemma 8.14, 


Isle < ole, Ne < 67 IB<byvb2 38. < OWN *) 


n=l 
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> N-3/2||S4||5 < O(N~1-#/2) < O(N-!-4): The sequence {An41 = An+1—Apxe, ;} is uncorrelated, 
so that 


a N BN 

é : 2 

ISRIB = [QO Ansaf], = do Anal 
n=1 n=1 


Applying (PR2) gives ||An+il/3 < baE[\|6,|2], and then from (8.78) we then obtain |/An41||2 < 


baVbzBy. Putting these together with Lemma 8.14, 


S N 
Snll3 < v/bzba 5 Bn < O(N?) 4 


n=1 


8.7 Exercises 


8.1 Consider the linear SA recursion (8.59), this time with Gaussian multiplicative disturbance 
On+1 = On + Onsil(—1 + En41)8n + An+i] 
En41 =V1—6E,+ Vb Ant 


with 6 € (0,1) and A ii.d. N(0,1). In some sense this multiplicative noise has far less memory 
than the example of Section 8.5.2. The sense in which this statement is true is made precise in [59], 
where the example (8.59) was first introduced. 

(a) Verify that & is a Markov chain whose marginals are Gaussian for each deterministic =o, and 
ergodic in the sense that for any r € R, 


lim Plea 27} —]P{ai = Ft 


Consequently, the ODE approximation for this SA recursion is again linear, with f(0) = —0. 

(b) Repeat the experiments in Section 8.5.2 to investigate whether or not the CLT holds for this 
example, by obtaining a histogram of {Z’,:1<i< M} for M = 10° and n= 10™ for m =3,4,5 
(recall (8.9) for the definition of Z,,). Test several values of 5, and in each case obtain histograms 
after removing outliers (say, estimates satisfying |Z,,| > 5). 


8.2 Avoidance of traps. We wish to minimize the function a function I(x) over x € R. We have 
access only to noisy measurements of its gradient: 


Y(k) = VI (-)+ N(k) 
where WN is i.i.d., with zero mean, and unit variance. In this exercise you will try out the SA 
algorithm using ['(«) = x?(1+ (x + 10)?). 
(a) Apply the stochastic approximation algorithm repeatedly, from various initial conditions, to 


obtain estimates {X(k)} of z* = 0. Obtain an estimate of the probability that X(co) = 0 when 
X(0) = 20. Repeat, with X(0) = —20. 

(b) Compare the sample path behavior of the standard SA algorithm, with the algorithm obtained 
using Polyak’s averaging technique (again for X(0) = 20, and X(0) = —20). You should present 
histograms for multiple runs, reviewing the advice and warnings in Section 6.7.1. 


(c) Propose a modification the SA algorithm to ensure that your estimates converge to 2* = 0 
with probability one, from each initial condition. 
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8.8 Notes 


This chapter is distilled from many years of collaborations and discussions with Vivek Borkar, Ken 
Duffy, Ioannis Kontoyiannis and Eric Moulines, as well as recent collaborations with students and 
colleagues. Some of the material in this chapter is adapted from the first half of the book chapter 
[110], co-authored by a recent graduate student Adithya Devraj and colleague Ana Busic. 

For a full history of stochastic approximation it is best to go to the classic texts [39, 204, 65, 67]. 
In brief: a scalar version of the stochastic approximation algorithm was introduced by Robbins and 
Monro in [301]. Blum in [58] extended the theory to vector valued recursions. Convergence theory 
for SA appeared to be mature by the end of the 1990s, marked by the ambitious work of Benaim 
on the dynamics of stochastic approximation algorithms [37], which led to my favorite book on the 
subject [65]. As made clear in the second edition [67], the theory evolves and becomes stronger 
each year. 

What follows is a bit more history on topics most relevant to RL. 


SA and RL One driving force behind the recent evolution of RL has been the need for new theory 
to support complex RL algorithms, and one person who has formed the strongest bridge between 
SA and RL is Bhatnagar [296, 297, 180, 379]. The fact that many RL algorithms can be cast as 
instances of SA was first observed in [352, 169]. Over the decade that followed, SA theory was a 
primary tool of the MIT RL school [192, 193, 356, 354, 353], that had tremendous impact on my 
own research. With increasing interest in RL there has been an impressive wave in contributions 
to stochastic recursive algorithms, with particular attention on identification of sharp bounds on 
the rate of convergence. 


Stability The stability theory contained in Section 8.6.1 is taken from [70], which is expanded 
upon in Borkar’s monograph [65, 67]. The recent work [59] provides conditions justifying Assump- 
tion (PR4) for nonlinear SA recursions: it is shown that sup E[||Z,,||4] < 00 subject to two significant 
assumptions: the ODE@oo is stable in the sense of Lyapunov, and the underlying Markov chain 
satisfies a condition slightly stronger than geometric ergodicty. 

Stability conditions for two time-scale SA can be found in [209], justifying the assumptions of 
Thm. 8.3. 

Stability theory for Zap SA is a simple corollary to the general stability theory for two time-scale 
SA, provided the assumptions of the theory hold. Fortunately, the stability of the Newton-Raphson 
flow is almost universal, even when the function f is not everywhere smooth, and this leads to 
convergence of Zap SA under conditions more general than predicted by the general theory [90]. 

The two time-scale algorithm (8.52) that defines first order Zap SA is inspired by a similar 
architecture used in GQ learning [235]. An alternative approach to approximate the Newton Raph- 
son flow without matrix inversion was introduced in [108], based on ideas similar to the second 
order techniques of Polyak and Nesterov (essentially in (4.155) in which {6;,} is a carefully designed 
matrix sequence). 


Asymptotic statistics Asymptotic statistical theory for SA is extremely rich. Large Deviations 
or Central Limit Theorem (CLT) limits hold under very general assumptions for both SA and 
related Monte-Carlo techniques [39, 204, 188, 65, 67, 257]. The variance analysis in Section 8.6.2 
is adapted from the surveys [{112, 109, 110], which themselves are based on standard material 
[65, 67, 193, 204, 39]. 


Pre-publication draft -- March 25, 2022 


CHAPTER 8. STOCHASTIC APPROXIMATION 313 


The optimal asymptotic variance, and techniques to obtain the optimum for scalar recursions, 
was introduced by Chung [95] soon after the introduction of SA (see also [309, 125]). Chung’s 
algorithm can be cast as a form of stochastic Newton-Raphson (described in Section 8.4.3). 

Gradient-free methods known as stochastic quasi Newton-Raphson (SqNR) appeared in later 
work: The first such algorithm was proposed by Venter in [366], which was shown to obtain the 
optimal variance for a one-dimensional SA recursion. The algorithm obtains estimates of the 
SNR gain —A~! through a procedure similar to the Kiefer-Wolfowitz algorithm [182]. Ruppert 
introduced an extension of Venter’s algorithm for vector-valued functions in [305]. 

The averaging technique came decades later in the independent work of Ruppert [306] and 
Polyak and Juditsky [287, 288] ([193] provides an accessible treatment in a simplified setting). It is 
noted in [266] that the averaging approach often leads to very large transients, so that the algorithm 
should be modified (such as through projection of parameter updates). The brief introduction in 
Section 8.6.3 is inspired by the elegant summary of [288] contained in [266]. 

The more recent Zap SA algorithm can be regarded as a significant extension of Chung’s original 
idea. It is often far more practical, since stability is essentially universal [109, 112, 113] (see also 
the dissertation [107]). 


Less asymptotic statistics The articles [272, 266, 20] had significant impact because they es- 
tablished similar finite-n bounds for stochastic gradient descent, and also obtained the optimal 
covariance by use of PJR averaging. These articles also compare SA with ERM, arriving at conclu- 
sions similar to those of Section 8.2.3: in many cases ERM leads to significant complexity without 
clear benefit. However, this is only true if the SA algorithm is designed with care (meaning, 
attention to all of the potential traps surveyed in Section 8.1). 

The article [20] suggests variants of PJR averaging using a constant step-size in (8.10a). In 
experiments this often works very well. Unfortunately, theory is currently lacking outside of the 
special linear SA model (8.66) with martingale difference disturbance [265]. The example in Sec- 
tion 8.5.2 and the example in Exercise 4.18 each suggest that analysis will be more challenging for 
linear SA when {A,} is a random matrix sequence, rather than fixed as in (8.66). 

The recent paper [351] presents new approaches to finite-n bounds based on a combination of 
perturbation theory for ODEs due to Alekseev [4] and approximation theory for martingales. 

The literature on finite-n error bounds for SA recursions with Markovian noise has been recent. 
Bounds are obtained in [49] for both vanishing step-sizes, and for (carefully selected) constant step- 
size TD-learning, with projection. Finite time bounds for SA with constant-step size are obtained 
in [333] by considering the drift of an appropriately chosen Lyapunov function. In the recent work 
[89] (briefly surveyed in Section 8.6.2), the limit (8.6) is refined to obtain something approaching a 
finite time error bound, which provides further motivation for optimizing the asymptotic covariance 
Ng. All of this theory is encouraging: these approximations justify the use of asymptotic bounds 
for algorithm design. 

The asymptotic covariance also lies beneath the surface in the theory of finite-time error bounds 
such as (8.31). Here is what can be expected from the theory of large deviations [105, 196]: on 
denoting the rate function by 


def 


Ii(e) # = lim, = log P{l0n (i) — 6° (0) > 2} (6.79) 
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the second order Taylor series approximation holds, 


1 
Yo (i, 2) 


Iife)=5 ge? +. O(e*) (8.80) 


Hence a small asymptotic covariance is a pre-requisite for a large rate function, and hence also a 
large exponent I[(e) in (8.31). 
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Chapter 9 


Temporal Difference Methods 


This chapter is a gentle introduction to temporal difference methods in a stochastic environment. 
The main challenge here is that we cannot directly apply the ideas in Chapter 5, because the 
Bellman equations for MDPs involve a conditional expectation. 

The discounted-cost optimality equation (DCOE) is a focus in the second half of this chapter. 
The state-input value function Q* admits a model free representation entirely analogous to (3.46): 
for any admissible input, and each k > 0, 


0 = E[-Q*(X(k), U(k)) + e(X(k), U(k)) + yQ*(X(k + 1)) | Fe] (9.1) 


where Q* (x) = min, Q*(«,u), and F;, represents the “history up to time k”. Section 9.2 contains a 
tutorial on conditional expectations that is intended to simultaneously demystify these abstractions, 
and propose approximations of (9.1). 


Asymptotic Statistics. Algorithm design and performance analysis is centered around the 
asymptotic covariance, without forgetting warnings regarding transients (recall the examples 
in Section 8.3.3). 


Fig. 1.3 illustrates the power of asymptotic statistics for estimating confidence bounds in 
RL. The plots demonstrate that the covariance can be estimated based on data collected over 
a short run: as small as N = 104 in this example. The data was obtained from a Q-learning 
algorithm introduced in Section 9.7, and the theoretical density was computed based on the 
Lyapunov equation (8.28a). 


This chapter may be regarded as a “part 1” on temporal difference methods. The first half of 
this chapter focuses on the simpler problem of approximating the value function or Q-function for 
a fixed (possibly randomized) policy. Algorithms to approximate Q* are surveyed in Sections 9.6 
to 9.8. 

In addition to discounting, theory is restricted to finite state space and input space, but the 
notation should make it clear that these assumptions are not essential. Algorithm construction and 
analysis will be simplified further by focusing on linear function approximation: 


H®=6%, :XxU>R? (9.2) 


315 
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Chapter 10 contains extensions and a wealth of theory that forms a foundation for actor-critic 
methods. 

The use of H® as an approximation of Q* or a fixed-policy Q-function is a departure from the 
notational convention in Chapter 5, which used the more suggestive notation Q°. The motivation 
for the change will be clear in Chapter 10 when we consider a parameterized family of policies. 

The following notational conventions will be followed in this chapter and the next: 


Ergodicity for a stationary policy op. The Markov chain X has transition matrix ee 


def 


defined in (7.12). The pair process ® = {®(k) = (X(k),U(k)) : k => 0} is also Markovian, 
with state space Z = X x U, and transition matrix 


Ty (2; 2?) =P,(x,2')b(u' | 2’), for any z=(a,u) and z' = (a’,w’). (9.3a) 


It is assumed that the invariant pmf 7 for Px is unique, and the invariant pmf for Ty is then 


@(x,u)=n(x)b(u|z),  wexX, uweUu (9.3b) 


As in Chapter 5, in an attempt to streamline notation we denote 
din) = ¥(P(n)) and cp = c(®(n)), — n Z0 (9.4) 


with ®(n) = (X(n), U(n)) as above. 

Another notational convention regarding a function H: X x U —> R requires explanation. In 
eq. (3.7d) and eq. (9.1) the function Q* is defined as a minimum. This notation will be modified, 
depending on the context: ~ 


Underbar notation explained. For H: X x U->R, 


Ee et (ee) Fixed policy setting, @ deterministic (9.5a) 

Lo) — SS H(a,u)(u | x) Fixed policy setting, & randomized (9.5b) 
U 

EE ein I (e-) Approximating a DP optimality equation (9.5c) 
U 


The notation (9.5a) is a simplification of (5.28b), which stressed the particular policy of interest. 
The common notation is useful in this chapter and the next to stress the similarity of the various 
algorithms. The meaning will be clear from context. 


We begin with control background and motivation for estimating the fixed-policy value function. 


9.1 Policy Improvement 


Let’s begin with a quick recap of the fixed-policy value function. To simplify notation we restrict 
to deterministic policies in this section, so that the convention (9.5a) is adopted to define H. 
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9.1.1 Fixed-policy value functions and DP equations 


Two value functions are of interest in this chapter: 


h(x) =} 7 y*Ele(®(k)) | X(0) = 2] U(k) = O(X(k)), k2 0 (9.6a) 
k=0 

Q(x,u) = >) y*Ele((k)) | X(0) =2, U(0) =a] U(k) = (X(k)), k21 (9.6b) 
k=0 


The latter is known as the fixed-policy Q-function. Each satisfies a dynamic programming equation: 


h(x) =co(x) +7 >) Poa, e')h(2'),  Q(2,u) = e(x,u) +7 >> Pala, 2')Q(2’) (9.7) 
a! a! 
with cg (x) = c(x) = c(x, b(x)) and Q(x) = Q(a, (a)) for x € X (recall (9.5a)). The DP equation 
for h appeared in (6.33). 

The chapter starts off with methods to approximate h. In this case it is convenient to suppress 
dependency of the transition matrix and the cost function on @. That is, we write P instead of 
Py, and c in place of cy. Subject to these conventions, the DP equation for h in (9.7) becomes 
h=c+y7¥Ph, with the probabilistic implication 

0= E[—h(X(k)) + c(X(k)) + yR(X(k + 1)) | KD) prey CR) ‘ k>0 (9.8) 
The fact that h solves a linear fixed point equation makes the function approximation problem far 
simpler (compared to the nonlinear fixed point equation (9.1)). The biggest challenge is, how do 


we devise a learning algorithm that takes into account the conditional expectation appearing in 
(9.8)? Approaches to answer this question are surveyed in Section 9.2. 


We first answer the question: what do we do with a value function? The most common an- 
swer is the policy improvement step in the Policy Improvement Algorithm (PIA) (also called the 
Policy Iteration Algorithm). The definition isn’t very different from the deterministic counterpart 
introduced in Section 3.2.2. 


9.1.2 PIA and the Q-function 


For the discounted-cost criterion, the algorithm is defined as follows: 


Policy Improvement Algorithm (PIA) for discounted cost 


Given an initial policy do, a sequence (dn, hn) is defined as follows: At stage n, given On, 
(i) Solve 
Cin + ¥Pnhn = hn (9.9a) 
where P, = Py,, is the transition matrix obtained when the chain is controlled using yn. 


(ii) Construct a new policy: 
Pn4i(x) = argmin(c(z, u) + yPuhn (2)) , LEX (9.9b) 


where P,,h,, (a) is defined in (7.15). 
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This algorithm is consistent, with h,(x) | h(a) for each x, as n > oo (the fact that h,(zx) is 
non-increasing is a corollary to Prop. 9.2 below). 

TD-learning algorithms are designed to approximate the solution to (9.9a). However, we cannot 
obtain the updated policy (9.9b) unless we have the model P,,, and are also prepared to solve the 
minimization over u for each x. A potential solution to both challenges is to approximate the 
function of two variables within the brackets in (9.9b), the fixed policy Q-function associated with 
the policy @y: 

Q,(a;u) = c(x, 0) +7 Pyha(e) 
Any policy satisfying },(x) € arg min, Q,(xz,u) for each is called “Q,-greedy”. 

The representation (9.6b) holds: 


Qn(x,u) = >) y*Ele(®(k)) | (0) = (2, w)] with U(k) = bn(X(k)) for k > 1, 
k=0 


along with a fixed-point equation that is very similar to (9.9a): 
0 = E[-Qr(®(n)) + c(®(m)) + YQn(®(n+1))| Fn], with U(k) = Pn(X(k)) for each k, 


and with F,, shorthand for {®(0),...,®(n)}. Any TD-learning algorithm can be applied to ap- 
proximate this Q-function, which is made possible by recognizing that the pair process ® is a 
time-homogeneous Markov chain, whenever U is defined using a stationary Markov policy. 

A TD-learning algorithm designed to estimate a fixed policy Q-function is commonly known as 
SARSA. However, as explained in Section 5.3, we will opt for the term T’D-learning regardless of 
whether we are estimating h or Q. 


9.1.3 Advantage function 


If we have computed exactly the Q-function for a policy , then the policy improvement step is 


given by b* (zx) = arg min, Q(z, u). Rather than estimate Q, we could estimate Q — G for any 


function G that does not depend upon u. The Q-greedy policy is unchanged 
bt (x) € arg min{Q(z, u) — G(x)} = arg min Q(z, u) (9.10) 
UU U 
A reasonable way to choose G is by minimizing the mean-square error: 


G* = arg min ||Q — G2, = arg min Exl{Q(#(n)) — G(X(n))P)] 
G G 


where the subscript indicates that the expectation is in steady-state. 
The proofs of Props. 9.1 and 9.2 are postponed to Section 9.9.1. 


Proposition 9.1. The optimizer is the value function: G* = E[Q(®(n)) | X(n) = a] = h(z). 
The difference V = Q — h is known as the advantage function. The probabilistic implication 


holds: 
0 = E[-V(®(k)) — h(X(k)) + c(®(k)) + yh(X(k + 1)) | ®(0),..., ®(K)] (9.11) 
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Any of the TD algorithms can be adapted to approximate a solution, based on a joint parame- 
terization {V°, h? : @ © R@}. Details can be found in Section 9.5.4 and Section 10.2, as well as 
algorithms to approximate the advantage function without a joint parameterization. 

The advantage function appears in actor-critic methods as a variance reduction technique, and 
also because of the following proposition which provides a means to compare policies. 


Proposition 9.2. For any two policies @ and o, let hp and hg denote the associated value 
functions on X, and Vp the advantage function for policy @. Then, 


hg (x) = h(x) +E? > 7*Vo(®(k)) | X(0) =a], «rex 
k= 


where the superscript on the expectation indicates that ®(k) is the Markov chain obtained with 


policy p. O 


Consider for example the choice (x) = arg min, V(x, u), which represents the policy improve- 
ment step. We then have 


Vo (2, 6(a)) = min Vo (2, u) = ho(r) + min Qo (x, u) < ho (x) + Q(x, (e)) =0, rEX. 


It follows that Vp(®(k)) = Vp(X(k), p(X (k))) < 0 under the policy $, for any k > 0. Together 
with Prop. 9.2, this justifies the term policy improvement: hg(x) < ho(x) for each z. 


Example 9.1.1. Advantage for the M/M/1 queue 


The value of the advantage function is most obvious in this example. To obtain an MDP model, 
we adopt a special case of the CRW model introduced in Section 7.3: 


a ify=a4+l1 


ee fy=@=—a),, ae 


P(X(k +1) =y|X(k) =, U(k) =u) = Pos) ={ 


with U = {0,1}. For any cost function that is non-decreasing on X = Z+, the optimal policy is 
non-idling @*(x) = 1{x > 1}. 

Consider the cost function c(#) = x with the average cost criterion. An expression for V* = 
Q* — h* follows from the pair of identities 


h* (a) =a —n+E[h*(X(k+1)) | X(k) 


= | (Poisson’s equation) 
Q*(x,u) =a +Elh*(X(k+1)) | ®(k) = (2, u)] (definition) 


The relative value function h* appeared in (7.31), from which we obtain 


2 

Lk 

h*(x) = 4 
U—-a 


V*(@,u) = + E[h*(X(k + 1)) | ®(&) = (@, u)] — Efh*(X(K +1) | X(K) = 
= n+ ah*(a +1) + p{uh*(a —1) + (1 — u)h*(x)} — [h*(x) — @ +m] 


(9.13) 


where 7 = 71(c) = a/(t — a) is the steady-state mean under ¢*. 
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A small amount of algebra gives 


Lb 
b-a 


V* (2,4) = Q* (a, u) — A*(z) = (1 —u)x 
Crucial conclusion: The growth rate of Q* is quadratic in x, while V* grows linearly. This is a 
tremendous benefit for variance reduction when applying TD learning or actor-critic algorithms 
(see Section 10.2 for algorithms to estimate V*). 

Obviously, nobody cares about optimizing the M/M/1 queue! Fortunately, similar structure 
for value functions holds for Markovian queueing networks when c is a linear function of the state 
(252, 254]. 1" 


9.2 Function Approximation and Smoothing 


We first attempt to demystify the conditional expectations appearing in the previous section. It is 
best to start with an abstraction: 7 
Z=EZ|Y] (9.14) 


where Z and Z are each scalar-valued random variables, and Y is a vector-valued random variable. 
For example, Y = (X(0),...,X(k)) in (9.8). When convenient to emphasize dependence on k we 
adopt the notation (7.14): 

E[Z | Fe] = E[Z | X(0),...,X(k)] 


Conditional expectation and projection Subject to the assumption that E[Z?] < oo, the 
random variable Z is the solution to a function approximation problem of the form surveyed in 
Section 5.1: Z = ¢*(Y), where 


o* = argminE[(Z — ¢(Y))?| (9.15) 
oEH 


and H = {¢: E[¢(Y)?] < co}. So, keep in mind throughout the remainder of the book: 


Conditional Expectation The conditional expectation Z , given the “data” Y, is the mini- 
mum mean-square error estimate of the random variable Z based on this data. 


The solution to the optimization problem (9.15) can be characterized geometrically—See [154] 
for a proof of Prop. 9.3, along with much more theory and intuition on this topic. 


Proposition 9.3. A function ¢° € H solves the minimum in (9.15) if and only if the orthogonality 
property holds: for each g € H, 


0=EZ— (VY) fgV)| O 


A valuable corollary to Prop. 9.3 is the smoothing property of conditional expectation. If we 
have two random vectors X,Y, then we can take the conditional expectation based on the total 
data (X,Y) to obtain a better estimate of Z (we have increased the size of the function class 
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(9.15) to all functions of the pair (x,y), so the optimal mean-square error cannot be worse). The 
smoothing property is then the consistency identity, 


E|Z|¥] =E/EZ| x. Y) | ¥] (9.16) 


Computing a conditional expectation is obviously a lofty goal, given the size of the function class 
H. How do we verify the characterization in Prop. 9.3, which requires evaluating an expectation for 
each g € H? The answer is, we cannot in most cases, so we resort to the approximation techniques 
introduced in Section 5.4.1. 


Approximating the Ly Projection. Given d basis functions {v;}, define a finite-dimensional 
function class by linear combinations H = {>>, 0:1; : 9 € R“}. The estimate ¢*(Y) of Z, with 
o* EH, is defined by either the Galerkin relaxation, or a restricted projection: 


> Galerkin relaration: Define a second collection of functions G = {>>, Oe : 6 € R44, 
and select ¢* € H that solves 


n~n 


O=E(Z-4(Y))9(¥)], 9 EG (9.17) 
> Projection onto H: Solve the projection problem 
Oe arg min{E[(Z — W(Y))]: ¢€ H} (9.18) 
We then write 
22) ee!) (9.19) 
Prop. 9.4 implies that the solution to (9.18) is identical to the Galerkin relaxation using G=HH. 


Its proof is similar to (and simpler than) the derivation of (5.18):x 


Proposition 9.4. A function * = >, 97 vi solves the minimum in (9.18) if and only if 


nN 


O=E(Z-O(Y))9V)], geH (9.20) 
Any solution satisfies o* = W160" with 
RG =p" (9.21) 
where RY is ad x d matrix and aa is a d-dimensional vector, with entries 
RE = EMV WY), By = ElZd(¥)] (9.22) 
Consequently, if R® is full rank, then 0* = [RY] 17 is the unique solution. O 


def 


Linear independence In this chapter we apply Galerkin relaxation techniques based Y = wn) = 
(®(n)) for arbitrary n, where ® is the Markov chain with transition matrix (9.3a) (in Section 9.4 
we briefly consider the restriction to X rather than the pair process ®). For the purpose of 
analysis, it is assumed that ® is stationary, which implies that the d-dimensional stochastic process 
{Wn) :n € Z4+} is also stationary. Its autocorrelation sequence is denoted 


R(j) = R(-J)" = ExlbengyyYmy'), GE Zt (9.23) 
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The matrix RY = R(0) is the steady-state version of the sample correlation matrix RY defined in 
(5.19), and a version of the matrix in (9.22). 
The following definition appeared in Prop. 5.7: 


The basis vectors are said to be linearly independent if the correlation matrix is full rank: 
RY = Ealbiny tiny] >0 (9.24) 


An equivalent definition: Eg[{6™(n) }?] > 0 for any non-zero 0 € R¢. 


Consider for example the tabular setting (5.10): X and U are finite, Xx U = {(a’,u’) : 1 <i < d}, 
and w(x, u) = 1{(2,u) = (a#*, u')} for each 7. When using a deterministic policy U(k) = @(X(k)) it 
follows that ;(X(k),U(k)) = 0 for every k if u’ 4 b(2'). The rank condition (9.24) is not satisfied 
in this case. For any randomized policy ob, the matrix in (9.24) is diagonal, with 


RY (i,i) = @(z',u") =n(2")O(u' | a"), 1<i<d (9.25) 


This matrix is full rank if and only if there is full exploration, in the sense that the right hand side 
is positive for each 7. 


9.3 Loss Functions 


In this section and the next we seek approximations of the discounted-cost value function h defined 
in (9.6a). This is formulated as an approximate solution to (9.8) for each k, among a class of 
functions {h? : 6 € R¢}. The temporal difference is defined as the error without the conditional 
expectation: 

Do, = —h?(X(n)) + c(X(n)) + yh? (X(n +1) (9.26) 


If there is 0* € R@ giving ayaa | Fn] = 0 for all n, and if every state in X is visited, then h® 
solves (9.8). As in previous chapters, in most cases we can only hope to approximate. 
For linear function approximation we denote 


ho = Ty, (9.27) 


where w is a d-dimensional function on X rather than Z = X x U. To avoid risk of confusion we 
abandon the simplified notation (9.4) in this case. 


9.3.1 Mean-square Bellman error 
For each @ there is an associated Bellman error, 
BY (e) = E[D9 4 | X(n) =a] = — h(a) +(x) + Ph (a), 2EX (9.28) 


where the second equality is the definition of the transition matrix. The mean-square Bellman error 
(MSBE) is then defined by 
Ex[{B"(X)}?] (9.29) 


Pre-publication draft -- March 25, 2022 


CHAPTER 9. TEMPORAL DIFFERENCE METHODS 323 


To minimize this objective function we might apply first order methods to find a stationary point: 
a vector 6° satisfying 


0 = AV ,Ex{{B9(X)}"] = Ex[{B(X)}V0B"(X)] 
On substituting the definitions we obtain representations that suggest algorithms. 
Lemma 9.5. The following holds for each 6 € R?: 
—3VoEn[{B°(X)}7] = EnlPaiiGl 


where 


Gn = VoE[h?(X(n)) — yh?(X(n + 1)) | Fal (9.30) 


Proof. By the Markov property we have 
B°(X(n)) = E[Di41 | X()] = E[Dag1 | Fal 
The gradient of interest is thus, 
—5VoEn[{B"(X(n))}7] = ELEDn 41 | Ful] 
where C9 is given in the statement of the lemma: 
Gr = —VoE[Di41 | Fn] = VoE[h?(X(n)) — yh?(X(n + 1) | Fa] 
The smoothing property of conditional expectation completes the proof: 


ELE[Dn 41 | FalGnl = E[DnaaGn - 


The lemma suggests that it is possible to approximate the gradient flow for the objective 
(0) = E,[{B°(X)}"] using stochastic approximation, such as 


On+1 = On + Oni DE AG, | = 


However, the conditional expectation in (9.30) presents a challenge. Prop. 9.4 suggests many ap- 
proximations based on the approximate conditional expectation (9.19). An alternative is suggested 
in the following: 


Proposition 9.6. 
Ex[{B°(X(n))}"] = Exl{Dr4i}’] — 78 (8) (9.31) 


where o7,(0) denotes the conditional variance: 


op (9) = Exl{Dnai — [Daas | Fal} O 


The objective function [(0) = $Ex[{D%, ,}7] is thus equal to the MSBE plus $02(0), which is 
relatively small in many spipliedtions. An SA algorithm designed to approximate the gradient flow 
is straightforward: 

On+1 = 6,+ On4+1DPn4iGn41 (9.32) 


. def 
with Dry = Ds, and 


Cnt =—VDosilpeo, = Volh?(X(n)) — yh°(X(n + 1))] oo, 


One can use any of the techniques in Chapter 8 to accelerate this algorithm. 
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9.3.2 Mean-square value function error 


An alternative mean-square error is defined in terms of the value function: 


6* = arg min ||h? — Al (9.33) 
0 
in which the choice of norm is part of the design of the algorithm. Most common is 
]n? — hil S So (n"(e) — h(@))Pn(a) (9.34) 
TEX 


in which 7 is the steady-state pmf for X. 

It is aremarkable fact that this loss function can be minimized without observations of {h(X(k))}. 
One class of algorithms for this purpose is TD(1)-learning. The algorithm is defined below, and 
conditions under which it solves (9.33) are presented in Thm. 9.7. 


9.3.3 Projected Bellman error 


Assumed given is a d-dimensional stochastic process ¢€ known as the sequence of eligibility vectors. 
The goal is to obtain the vector 6* € R@ that solves the Galerkin relaxation of (9.8). The smoothing 
property of conditional expectation gives 


0 = E[{—h™ (X(k)) + e(X(k)) + yh (X(K+D))}G@], L<is<d. (9.35) 


It is usually assumed that the expectation is in steady state (so that X(k) is distributed according 
to =). If h = h® for some 6° € R%, and if the solution to (9.35) is unique, then the Galerkin 
approach will yield the exact solution h. 


9.4 TD(A) Learning 


The goal of this algorithm is to solve (9.35) for a particular choice of eligibility vector. We begin 
with a special case for which there is a rich supporting theory. 
9.4.1 Linear function class 


In TD(A) learning with linear function approximation, the eligibility vectors are defined by passing 
{w(X(n)} through a first-order low-pass filter: 


Cnt1 =AWGn + Y(X(n + 1)), n>0 (9.36) 
It is always assumed that A € {0, 1]. 


TD(A) algorithm 
For initialization 09 , ) € IR“, the sequence of estimates are defined recursively: 
On41 = On + OntiGnPn41 
Dnsa = (-h9(X(n)) + e(X(n)) + 7h(X(n + 1)))| (9.37) 


6=6@n, 
Cnt = AVGe + w(X(n + 1) : 
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The random variable D,,41 is the temporal difference introduced in Section 9.3: using the 
definition (9.26), we have simplified notation via 
Dn4i = Co 
The proofs of Thm. 9.7 and Prop. 9.8 are postponed to Section 9.9.2. 


Theorem 9.7. Suppose that 6* solves (9.35), where the expectation is in steady state, and the 
eligibility vector is defined using TD(A) with linear function approximation (9.27). The solution 
has the following interpretation for two choices of X: 


(i) A =0: In the notation of (9.19), 
E[Dr+1 | Yn] =0, 


with Y, = w(X(n)) and D8, = —h® (X(n)) + c(X(n)) + yh (X(n+ 1). 
(ii) A= 1: & solves (9.33), with norm (9.34). O 


Subject to (9.27), we have 
Dnt = (X(n)) + [yb(X(n + 1)) — W(X(n))] "On 
and consequently (9.37) can be placed in the form of the linear SA recursion (8.53b), with 


An+1 = Gn[yb(X(n + 1)) — o(X(n))]" 
On41 = —Cne(X(n ) 


Let A = E[A,,] and b = E[b,], where the expectations are in steady state. If A is invertible, then 
6* = A~'Dd is the unique solution to (9.35). 

Prop. 9.8 (i) tells us that the matrix A is invertible and the TD(A) algorithm is consistent under 
linear independence. Part (ii) is of interest when we come to the average-cost setting. 

The definition of linear independence remains the same when w is a function of x only, and we 
continue to denote as in (9.24): 


(9.38) 


RY = Enly(X(n))b(X(n))"] (9.39) 
Let ©” denote the steady-state covariance of the basis: 
mY — RY — pol (9.40) 
where b = Ex[y(X(n))]. 


Proposition 9.8. The steady-state mean of the matrix Ap, defined in (9.38) satisfies the following: 


(i) If the linear independence condition (9.24) holds, then A is Hurwitz, and 6* = A~‘b for 
any y € [0,1) and \ € (0, 1]. 


(ii) If SY > 0 and ® is aperiodic, then A is Hurwitz for y= 1 and <1. Oo 


The following representation of A is required in the proofs of Thm. 9.7 and Prop. 9.8. See (9.23) 
to recall the definition of the autocorrelation sequence { R(7)}. 
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Lemma 9.9. A=—RY if\=0 or \=1, for any y € [0,1). Otherwise, 


A=—R(0) + (X71 = 1) S0(7A)'RO)T (9.41) 


i=1 


with autocorrelation sequence {R(i)} defined in (9.23). 


Proof. We start with the steady-state representation of the eligibility vector: 


Cn = SLAY Y(X(n - 4) 
i=0 
Applying (9.38) then gives 
A= Exlon[vy(X(n + 1) — ¥(X(n))]"] 


= dL 00)"Ex [2b(X(n = 4) (—o(X(n)) + WW (X(n + D))7] 


i=0 
= S00)'{-R(-2) + 7R(-i - 1)} 
i=0 
The representation (9.41) follows, using R(—i) = R(i)T. O 


Although TD(A) is convergent, the mean-square error may converge at a rate far slower than 
optimal. The optimal O(1/n) rate can be achieved by using one of the techniques from Chapter 8: 
Qn = g/n with g > 0 sufficiently large, the use of Polyak-Ruppert averaging (8.10), or an appro- 
priate matrix gain. One instance of the last approach is the Stochastic Newton-Raphson algorithm 
(8.54), which is known as LSTD(A) in the present context. 


LSTD(A) 


With initialization 09, Co € R¢@ and Ao e R*¢, 


On41 = On - onsrA, Crna (9.42a) 
Dnt = (X(n)) + [yb(X(n + 1)) — W(X(n))] "On (9.42b) 
Cri = AVGn + O(X(n + 1)), (9.42c) 
Ansa = An + On41[Anqa — An] (9.42d) 
Anti = Gal yy(X (n+ 1)) — Y(X(n))]" (9.42c) 


Prop. 8.8 can be applied, and suggests the simpler Monte-Carlo implementation: after gathering 
all the data up to time N, define Ay™° = Ay bn, where 


. 1 N-1 ” 1 N-1 
An = d A(k +1) by = — ap De, Geel XH) 
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9.4.2. Nonlinear parameterizations 


Suppose {h® : 6 € R%} are not linear functions of 6, but are differentiable. A generalization of the 
foregoing is based on the definition 


i) 


wi (a; 0) = ag,” x). 


The temporal difference and eligibility sequence are redefined as follows: 


Dn4i = e(X(n)) + yh?" (X(n + 1)) — bh (X(n)) (9.43a) 
Cn+1 = AVGn + U(X (n + 1); On), n>0. (9.43b) 


The TD(A) or LSTD()) algorithms can be defined using these definitions, but convergence theory 
is lacking. 

The Zap TD(A) algorithm is a potential alternative, which will be convergent under mild con- 
ditions. However, it is useful to go back to basics, and ask if there is good motivation for the choice 
of eligibility vector (9.43b). 

If the algorithm is convergent, then the limit 6* is expected to be solve 


0 = E[(c(X(n)) + yh (X(n + 1) — A(X (n))) Goa] (9.44) 


where €8',, = ACF" + p(X (n +1); 0*), n > 0, and the expectation in (9.44) is taken with respect to 
the joint stationary process (X, Cr’. The fixed point equation (9.44) no longer has an interpretation 
as a Galerkin relaxation when the eligibility vector depends upon the parameter 9. It may be best to 
modify the definition of the loss function so that the solution is more easily understood. Examples 
include convex Q-learning and actor-critic methods. 


9.5 Return to the Q function 


In this section and the next it is assumed that the input is defined by a (possibly randomized) 


stationary policy @. For the sake of analysis, it is assumed throughout that ® = {®(k) = 
(X(k),U(k)) : k => 0} is uni-chain, with unique invariant pmf given by (9.3b). 


9.5.1 Exploration 


It is time to discuss how to define ry when our true goal is to estimate a value function for a 
deterministic policy @. It may be useful to construct the randomized policy so that it can be 
regarded as an “e-perturbation” of d, where ¢ is a positive but small constant. 

One approach is to start with any randomized policy ob. This might be completely random, in 
the sense that db (u | x) does not depend upon x. The ¢-approximation is then defined as follows: 


b(u| x) = (1—e)1{u = b(2)} +e6"(w| 2) (9.45) 


This is implemented using an i.i.d. Bernouli sequence {J;,} with parameter ¢, so that P{J;, = 1} =e. 
The policy © is applied at time k if and only if J, = 0; otherwise the input is determined by db. 
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An alternative approach is available in the context of policy iteration, as surveyed in Section 9.1. 
Suppose that ¢ is itself defined as a minimizer: for a function G: X x U-> R, 


(x) € arg min G(z, uv) = arg max{—G(z, u)} 


For given ¢ > 0 the soft-maz is defined as 
clog{ Do exn(-G (x, u) \/e)} 


As € | 0, this is convergent to max,{—G(z, u)}. 
This motivates a class of randomized polices: 


Gibbs Policy. Given G: X x UR ande > 0, 
we: def il 
be | a) EG) exp(—G(z, u)/e) (9.46a) 


with k® the normalizing constant, defined so that & (- | x) is a pmf on U for each z: 


K(x) = a exp(—G(x, u)/e) (9.46b) 


The parameter ¢ is called the temperature, justified because the policy is highly random if ¢ > 0 
is large (high entropy conjures the image of boiling water), while the policy typically becomes 
deterministic (freezes) as € approaches zero. In particular, if G has a unique minimizer for each x 
then 

lim } (u | x) = 1{(x) = u} 
el0 


9.5.2. On and off algorithms 


Our main motivation for value function approximation is in application to approximate policy 
iteration, so we consider the fixed policy Q-function (9.6b): 


=> VEle(H(k)) | X(0) =e, U(0) =u) 
k=0 


When the policy @ is randomized we require the definition (9.5b): Q(x) = 7, Q(a, u)b(u | x). 
The fixed point equations (9.7) continue to hold, along with other expected relations: 


Proposition 9.10. The value functions h and Q obtained from a randomized policy b satisfy 


h(x) = cg (x) + > Po(a,z')h(x’), Q(a,u) = elm, u) + >, Plea OG) (9.47) 


g! 
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yu 


with cy (x) = D2, c(@, w)b(u | x), and Py defined in (7.12). The two are related via 
Q 


h(x) = Q(x) and Q(a,u) = c(a#,u) + > P, (x, 2')h(x') , for each x, u. (9.48) 


a! 


We obtain two model-free representations of the fixed point equation for Q in (9.47), distin- 
guished by the choice of input for learning. The following dichotomy was described in Section 5.3.1 
for deterministic control systems: 


> On-policy method: If U is chosen according to the policy ry then 
Q(®(k)) = c(®(k)) + VELQ(@(K + 1)) | Fe] (9.49) 
> Off-policy method: If U is any admissible input then the representation must be modified: 
Q(®(k)) = c(®(k)) + YEIQ(X(k + 1)) | Fe] (9.50) 


where F;, represents the history {®(0),..., ®(k)}. 

Consider now application of these representations for function approximation within a param- 
etrized family {H° : 9 € R¢}. The on-policy TD-learning algorithms surveyed in the previous pages 
are directly applicable on recognizing that ® is a Markov chain, and that Q is its discounted-cost 
value function. For example, here is the TD(A) algorithm for a linear function approximation 
architecture: H®(x,u) = 6TW(a, u) with =: X x U > R?¢: 


TD(A) algorithm (on-policy for Q) 
For initialization 09 , Gy € R®, the sequence of estimates are defined recursively: 


O44 = 0a On+1GnPn+1 


Dust = (—H®(®(n)) + exyH?(@(n + 1)))| (9.51) 


0=6y, 
def 


Cnt = AYGn + Wn41) , W(n-+1) = Y(B(n at 1)) » Cn = c(®(n)) 


The conclusions of Thm. 9.7 hold because we are simply replacing X with ® in the algorithm 
(9.51). The theorem is stated in this new notation for ease of reference. Equation (9.35) in this 
notation becomes 


0=E[{—H® (X(k)) +c(X(k)) + 7H® (X(kK+))}G@], 1<i<d. (9.52) 


Theorem 9.11. Suppose that 0* solves (9.52), where the expectation is in steady state, and the 
eligibility vector is defined using TD(A) with linear function approximation. The solution has the 
following interpretation for two choices of X: 


(i) A =0: In the notation of (9.19), 
E[Diris | Yn] = 0, 


with Yn = U(®(n)) = Wn) and D8, = —H® (®(n)) + en + yH® (@(n + 1). 
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(ii) A= 1: & solves 


0° = argmin||H°—Q\[Z= S$) (H%(#,u) — Q(a,u))’a(z,u) 
0 


reEX, uwEU 


The off-policy setting is far more practical in many cases. In particular, for application to policy 
iteration the policies {,,} obtained through (9.9b) are deterministic. However, in most cases we 
require a randomized policy to ensure sufficient exploration. 


TD(A) algorithm  (off-policy for Q) 
For initialization 09 , Cy € R4, the sequence of estimates are defined recursively: 


On41 = On + On4i¢nPn+1 
Dnt = (—H°(®(n)) + en + HO(X(n + 1)))| (9.53) 


Cnt = AVGn + Vn41) ) W(n41) - Y(@(n zi 1)) » Cn . c(®(n)) 


A critical difference is in the form of the temporal difference term: to obtain D,,+1 it is necessary 
to compute 


H®(X(n+1)) = (a (x, u)h(u | Cie) | eres (9.54) 
If this is too complex, an alternative is split sampling (recall (6.47)): 
Dust = (—H8(®(n)) + en + 7H(X(n +1), Una) |, (9.55) 


in which U/,, is a random variable that is conditionally independent of F,41 given X(n +1), with 
conditional pmf defined as follows: 


PU, =u| Fas} =o(u] a") when a! =X(n+1) 


The update equation is unchanged: 6n41 = On + Qnt+iGnPnii-. Theory for convergence using 
(9.55) is unchanged, but the algorithm will have higher variance (the covariance Ua appearing in 
Section 8.1.5 will be larger). 

The geometry behind the proof of Thm. 9.7 breaks down in the off-policy setting. The off-policy 
algorithm is intended to solve 


0 =E,[(—H°(®(n)) +e, + yH?(X(n + 1),U 4 1)) Gn], 


which can be expressed as the linear equation A@ = b with 


= Ex [Cn(—v(®(n)) + yp(X(n+ 1)))"] , b= —Ex[enGn| , and w(x = 2M x,u)d(u | x). 


We don’t even know if A is invertible, so the linear equation may not have a solution. 
The following result gives us some hope. The proposition looks at the entire range of y € (0, 1], 
so we write A, to emphasize the dependence. 
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Proposition 9.12. Suppose that the linear independence condition (9.24) holds. We then have: 
(i) For X =0, the matrix Ay, is invertible for all but at most d values of y € [0,1]. 
(ii) For each A € (0,1), the matrix Ay is invertible for all but a finite number of values of 
ye 


[0, 1]. 
(iii) The matriz A, with X = 1 is invertible for all but a finite number of values of y € [0,1—4] 
for each 6 > 0. O 


Prop. 9.12 does not claim that the matrix A, is Hurwitz, so stability of off-policy TD(A) is not 
resolved. The proposition justifies application of LSTD(A) in the off-policy setting since it is highly 
likely that A, is full rank. 


9.5.3 Relative TD(A) 


We can expect numerical challenges with TD(A) algorithms when 7 is close to unity. The reason 
comes from the fact that either of the value functions defined in (9.6) is typically unbounded as 
y ¢ 1. Here we exploit the fact that the value functions are large only because of an additive 
constant. 

This structure is illustrated via an example in Exercise 6.6 (b). For the general case, consider 
the fixed policy Q-function: 


Q,(2) =D y*Eln + & | 8) = 2] = 0 + DV Ele | B(0) = 2] 
k=0 k=0 


For 7 ~ 1, the right hand side is a very large constant, plus a term that approximates the solution 
to Poisson’s equation given in Thm. 6.3 (i) (for the Markov chain ® rather than X ). Consequently, 


il ~ 
lim{Qz(2) — z—anf=@@),  2EXxU 


where Q solves Poisson’s equation: 
Ela + Q(®(k + 1) | B(&) = 2] = Q(z) (9.56) 
The function Oy = Q, —/(1— 7) solves the DP equation 
e+ 4PQy=Qy+7 (9.57) 


We might use this to define a temporal difference sequence. In application to policy improvement, 
the new policy can be represented in terms of @, instead of Q,: 


+(x) € argmin Q, (x, uv) = argminQ,(z, u) 
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To avoid estimation of 7 we opt for an alternative, known as the (fixed policy) relative dynamic 
programming equation: 
c+ y7PH = H+ 6(p, A) (9.58) 


where 6 > 0 is a positive scalar, uw: X x U — [0,1] is a pmf (both design choices), and (u, H) = 


Se ,U [elas u)H (a, u). 
In the notation of Section 6.3 the DP equation (9.58) can be expressed 


[I -—(yP-61@w)|H =c 


Lemma 9.13 tells us that H can be expressed as a power series under mild assumptions on P and 
6. More important is that H = Q+ constant, so it is just as valuable as Q for application in policy 
improvement. 


Lemma 9.13. Suppose that ® is uni-chain. Then, for each y € {0,1), 


(i) The eigenvalues of the matrix yP — 61® wu coincide with those of yP, except for \y = 7 
which is moved to y — 6. 


(ii) Provided |y — 6| < 1 we have 


H=(I-(yP-61@y)] ‘c= SP 618 1)" (9.59) 


n=0 


The sum is convergent because (yP — 61 @ pw)" > 0 geometrically quickly as n + co. 
(iii) H=Q-—k, with 
) 


as er eee 


(u,Q) = roe) (9.60) 


The big step in establishing part (iii) is an application of the Matrix Inversion Lemma (A.1) to 
obtain the representation 


Uf —(yP-l@ py = (1-7) (L@ w(t — yP)"! (9.61) 


146-4 
The relative DP equation (9.58) has the probabilistic interpretation 
Eo |—H(G(n)) — 6(u, A) +n +yA(X(n4+1))|Fr]=0, n20 (9.62) 


where H(x) = Do, H(a,u)(u | x) as in (9.54). Suppose that {H® : 6 € R@} isa parameterized 
family of functions on X x U. The goal in relative TD(,)) learning is to find 6* satisfying f(0*) = 0, 
with 

F(@) = Ea [{—H(®(n)) — 6(u, H) + cn + H(X(n + Y)} Gn] (9.63) 
where {¢,} are the eligibility vectors—defined via (9.53) if the parameterization is linear. An SA 
algorithm to estimate 6* is described below in this special case: 
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Relative TD(\) algorithm  (off-policy) 
For initialization 9 , Co € R“, the sequence of estimates are defined recursively: 


Inti = On + An41GnPn41 
Dns = (—H*(B(n)) — (hu, H) + en + YHO(X(n + 1)))| 


Cn4ti = AVGn + W(n41) ’ W(nt1) = W(B(n a 1)) : 


= (9.64) 


This is a linear SA recursion based on the ODE 4 £9 = f(§) = Ad — b in which 
A= Ea [Sn (—Y(P(n)) — 0" + (X(n+1))"], b= — Ea [enc] 


where 7" is a d-dimensional column vector, whose ith component is the mean (, 7;). Conditions 
to ensure that A is Hurwitz can be obtained in the on-policy setting. 

A challenge remains for \ = 1 in which case b = —Eg[cn¢,] may be very large when 7 ~ 1, and 
this suggests high variance in the algorithm. The regeneration technique introduced in Section 10.3 
is one approach to obtain algorithms that are reliable when both y and X are close to unity. 

An alternative is to abandon approximation of Q and turn to the advantage function. 


9.5.4 TD(A) for advantage 
The advantage function introduced in Section 9.1.3 admits the following equivalent forms: 
Proposition 9.14. The following representations hold for the advantage function: 


V(a,u) = Q(z, u) — h(x) (9.65a) 
= —A(x) + e(x,u) + yE[A(X(k +1) | B(k) = (2, v)] (9.65b) 
= c(a, u) — eg (x) + {E[h(X(k + 1)) | ®(&) = (@, w)] — E[A(X(k +1) | X(k) = a]} (9.65c) 


for eachx € X, we U, and with cy (2) =, c(x,u)d(u | 2). O 


The proof is postponed to Section 9.9.1. 

The motivation for the advantage function is in part the desire to reduce variance in the estimate 
of the function used in policy improvement. This brings a question, can we estimate V directly 
with reduced variance (as compared to estimating Q and h separately, and then subtracting)? It 
might not be surprising that the answer is yes, but full justification won’t come until Section 10.2. 
We provide here a heuristic and an algorithm. 

The heuristic: we have E[V(®(k)) | X(k)] = 0 for each k, so it is reasonable to choose a function 
class for which the same holds for any approximation within this class. This is not difficult to 
arrange. 

The following notation is used throughout Chapter 10: 


=e b(u|a2) and w(x,u) = v(x, u) — v(2), rex, ueu (9.66) 
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We search for an approximation within the function class 
H = {H? £67) :0eR} (9.67) 


We have by definition E[H(®(k)) | X(k)] = 0 for any H € H. With a bit more work, we learn in 
Section 10.2 that the best estimate of V within H coincides with the best estimate of Q within the 
same function class. 

Stability of any algorithm based on this finding requires linear independence of the basis w. 


For a deterministic policy we have w(®(k)) = 0 for every k, so we are out of luck! Even for a 
randomized policy, the covariance matrix me may be rank deficient even when dS” is full rank. In 
this case the basis can be trimmed by writing 5% = CTC where C is an m x d matrix with rank 
m <d, and then replace w with the m-dimensional basis, 


w= [CCT ep 
The covariance of we is the m x m identity matrix, and the function class is unchanged: 


H ={wi?:weER™=H 


TD(A) algorithm (on-policy for advantage) 


For initialization wo ,¢o € R™, the sequence of estimates are defined recursively: 


Writ = Wy + Oni1CnPnr+1 


Dri = (—H®(®(n)) + en + YH” (G(n + 1)| (9.68) 


W=Wn, 


Cnt = An +¥°(H(n+1)), — H®(@(n)) = wy?(H(n)) 


Is it a problem if d = 104 and m = 1? Just the opposite: complexity is reduced, and for \ = 1 
the optimal estimate w* € R”™ defines the optimal Lz approximation of V. See Section 10.2 for 
details. The only remaining challenge may be computation of w=. See the discussion surrounding 
(9.54) for a solution using split sampling. a 


In the following sections we exit the fixed-policy setting, returning to (9.1). 


9.6 Watkins’ Q-Learning 


9.6.1 Optimal control essentials 
Two value functions are of interest to us here: 
a KETe( X(0 
(2) = gi > FEle(@(K)) | X(0) = a] 
(9.69) 
O(a.) = Pac X(0)=2, U(0) =a] 


u()U 
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where ®(k) = (X(k),U(k)), and in the optimal control setting we denote Q*(z) “ min, Q*(zx,u), 
following the convention (9.5c). Given one value function, we have the other: for each x and u, 


A*(x) = Q*(x) and = Q*(a,u) = c(a,u) + 1>~ P(2;2))i*(2') (9.70) 
g! 
The advantage function is defined as the difference V* = Q* — h*, which is a function on X x U 
taking non-negative values: min, V*(x,u) = 0 for each x. 
Q-learning algorithms are often based on a Galerkin relaxation that mirrors TD(A): given a 
parametrized family {H® : 6 € R®}, and a sequence of d-dimensional eligibility vectors {¢,}, the 
goal is to find a solution 6* to 


0 = f(O*) = E[{—H?(®(n)) + en + yH?(X(n + 1)) fon hg (9.71) 
The ODE method (8.4) leads quickly to an algorithm: 
1. Formulate the goal as a root finding problem f(6*) = 0, with f defined in (9.71). 
2. Refine the design of f to ensure that the associated ODE is globally asymptotically stable. 
3. Is an Euler approximation appropriate? Is f Lipschitz continuous? 


4. If step 2 is skipped (no modification is needed), and step 3 is answered in the affirmative, 
then we arrive at the SA algorithm: 


On+1 = On + Ongi{—H™ (®(n)) + c(®(n)) + yH™(X(n + 1)) fon (9.72) 


Q(0)-learning: the recursion (9.72) using ¢, = VH®(®(n)) ho 


For a linear parameterization H® = 67%), this gives Cy = Wn): 


In general, the ODE method fails at step 3 when the state space is not finite. An example is 
the LQG problem in which {H® : 6 € R®} is a linearly parameterized family of quadratic functions 
on X x U. In this case H® is rarely Lipschitz continuous as a function of @—see Exercise 9.4 for an 
example. 

In the finite state space setting we can expect Lipschitz continuity, but conditions ensuring 
stability of the ODE are not easily verified. 


Pes 18 eigenvalues: x y¥=0.8 
|A;yp 1 <i < 18 O 7 = 0.99 
10774 
9% @ 2828888888888 a% 
-10 1 1 1 1 1 1 1 1 . 


6-state MDP model: 18 state-action pairs 1 2 4 6 8 10 12 14 16 18 2 


Figure 9.1: Six-state directed graph for a finite state-action MDP example 
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9.6.2. Watkins’ algorithms 


One great success story is found in the special case called tabular Q-learning, with basis previously 
defined in (5.10): w;(z,u) = 1{(x,u) = (z*,u*)} for each i, with X x U = {(z',u") : 1 <i < dh. 
We continue to write 0 € R%, but in the tabular setting d = |X| x |U|, and the function class spans 
all possible functions on X x U. For this architecture the stability theory of TD learning admits 
extension to the nonlinear SA algorithm (9.72). 

A simple stochastic-shortest-path problem will be used to illustrate theory, in which the state 
space X = {1,...,6} coincides with the six nodes on the un-directed graph shown on the left hand 
side of Fig. 9.1. The input space coincides with the edges shown: U = {ez}, 2,2’ € X. The 
controlled transition matrix is defined as follows: If X(n) = x € X, and U(n) = eg € U, then 
X(n +1) = a’ with probability 0.8, and with probability 0.2, the next state is randomly chosen 
between the other neighboring nodes. The goal is to reach the state x* = 6 and maximize the time 
spent there. The cost function is designed with this goal in mind, and also a cost for movement 
from any node: 


0 C=—ten, ot 
c(x,u) = 45 ces LAO aaee 
—100 u=ex¢6 


In the tabular setting it is common to write H” rather than H®" (identifying the function 
approximation with the parameter). This is justified by the following representation: for each 7 


and 6, 
d 


S © Oj); (a!, u!) = 0(3) (9.73) 


jal 


a 
a 
B,, 
g.. 
] 


where the final equality holds because of the tabular basis. 
Two versions of tabular Q-learning are distinguished by the way in which samples of ® = (X,U) 
are obtained. 


Either of the following algorithms is commonly known as Watkins’ Q-learning. 


(i) Asynchronous: access to a single sample path of ®. 
H”™*!(¢,u) = H"(2,u) if ®(n) ¥ (a, u), and otherwise 


H”™ (a, u) = H"(2,u) + anti [—H"(z, u) + c(x, u) + yH" (X(n+1))] 


Asynchronous Q-learning can be implemented without a model: @Q* is approximated based 
on observations of the system 


(ii) Synchronous: access to a simulator to generate from the conditional pmf P,,(z, -) an iid. 
sequence E on X?, with Bag Pala 52 ) 
At iteration n +1 each entry of H"+! is updated: for i =1,...,d, 


BH (a!,u) = A" (2',u') + angi [-H"(a',u') + e(a', uw!) + 7” (En41)] 
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In either case, given the estimate of the Q-function at iteration n, we obtain the H”-greedy policy 
as any solution to 
b,(x) € argminH"(x,u), xrEX (9.74) 
U 


This section restricts to the asynchronous case (only to avoid repetition of equations and con- 
cepts in the theoretical development). Given the definition of the basis, asynchronous Q-learning 
is expressed as Q(0)-learning: 


On41 = On + On4+1D 4100 
Dna = —H"(8(n)) + en + ¥H"(X(n +1) (9.75) 
Gn = Vol H"((n))}]p-g, = Yen 


The recursion (9.75) for the Q-learning algorithm can be written in a form similar to the linear 
recursion (8.53b). On denoting Vonst) = W(X (n+1), bp(X(n+1))), with o, any H”"-greedy policy, 


Gna = On + An+1 [An+19n _ On| 


bata = —CnY(n) 


This is not a linear SA algorithm since the policy 6, depends upon 6. 

It should be clear that the algorithm (9.76) can be implemented with an arbitrary basis. The 
restriction to the tabular setting is made in this section only because there is a complete and 
accessible stability theory for Watkins’ algorithms and refinements to come. 


9.6.3. Exploration 
It is common to employ the Gibbs policy (9.46a) using G = H”: 


PUM=uixwean = =n iy P(A") /2) (9.77) 
where k**"(x) is a normalizing constant, and € > 0 is typically fixed (but it too may depend upon 
n, possibly vanishing as n — co). This is often highly successful, but avoided in the theoretical 
development of this chapter. While SA theory now has matured to the point where it can be 
applied to establish stability of Q-learning with a time-varying policy such as this, there is no space 
in this book for a proper treatment. 

Instead we assume that U is defined using a randomized Markov policy &, so that ® = {®(k) = 
(X(k),U(k)) : k > 0} is a Markov chain. Recall from (9.3b) that its invariant pmf can be expressed 
@(x,u) = m(x)b(u | x), where 7 is invariant for the Markov chain X. It is assumed throughout 
that X is uni-chain, so that 7 is unique. 

Linear-independence of the basis is defined by the rank condition RY > 0, with matrix RY > 0 
defined in (9.24). For the tabular setting, this reduces to the diagonal matrix (9.25): 


RY¥(i,i) = @(z'ju), 1<i<d (9.78) 


It follows that the tabular basis is full rank if and only if the Markov chain ® is irreducible in the 
usual sense, so that @(z', u’) is non-zero for each i. 


Pre-publication draft -- March 25, 2022 


CHAPTER 9. TEMPORAL DIFFERENCE METHODS 338 
9.6.4 ODE analysis 
Watkins’ algorithm (9.75) admits a simple ODE approximation. 


Proposition 9.15. The ODE approximation for the Q-learning algorithm (9.75) takes the form 
sO = f°(0;), with vector field 


(0) = @(a",u')[—H?(a",u') + e(x",u') i 2s (2', 2") H” (2’)| 


For each i, the function f? is concave and piecewise linear as a function of 0. O 


Concavity of f? for each i follows because H®(zx’) is concave in 6 for each z’, as it is a minimum 
of linear functions. This fact is useful when we come to Zap Q-learning. 

Prop. 9.15 raises concerns: the pre-multiplication by @(x*,u’) in the vector field f° introduces 
“low gain” for state-input pairs that are rarely visited, and suggests that the “curse of condition 
number” will be severe if @ is far from the uniform pmf (see Section 8.5.1). 

The path finding problem provides an example of this curse. The eigenvalues of A = 09 f°(0) 
are real and negative in this example. The right hand side of Fig. 9.1 shows {—A;} on a semi-log 
scale for two values of 7. Because A is Hurwitz, when using the step-size a, = g/n, the asymptotic 
covariance of the resulting algorithm is obtained as a solution to the Lyapunov equation (8.28a), 
provided each eigenvalue of gA are strictly less than —1/2. This translates to the bound g > 45 for 
y = 0.8, and g > 900 for y = 0.99. 

These observations inspire the state dependent step-size rule: a” (z', u") “0 for any i satisfying 
&(k) A (x',u") for k <n, and otherwise 


ay (x,u) = [number of times (x, u) has been visited up to time n] (9.79) 


Proposition 9.16. The following hold for Watkins’ Q-learning with step-size rule (9.79), under 
the assumption that @ is everywhere positive: 


(i) It is equivalently expressed as the Q(0)-learning algorithm (9.75) with an = 1/n, and 
modified using a diagonal matrix gain: 


1 _ —_— 
On+1 = On + Wa on Pntisn Go — A= 7 Da Sst (9.80) 


with the exception Gn(i,i) = 0 if ®(k) A (a',u*) fork <n. 
(ii) Its ODE approximation has vector field with components 


fi(0) = —H9(a',u') + eau) +70 Pala’) H? (2’) (9.81) 


gz! 


(iii) The parameter recursion in (9.80) admits the representation 


On44 = On =F Anti{GnonG f (On) + Dgsat ’ 


in which {A, :n > 1} is a martingale difference sequence. 
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Proof. The representation for f follows immediately from the definitions (especially the special 
structure for 7). The irreducibility assumption for ® is required here: if this assumption is violated, 
then there is an index i for which @(z', u’) = 0, and hence ®(k) ¥ (z',u’) for all sufficiently large 
k. This results in f;(9) = 0 for any 0. 
The sequence {A,,} is taken of the form A,41 = A’, ;GnGn, in which {A} } is a scalar martingale 
difference sequence satisfying 
E[Ani | Fn] =0 


where F,, represents the history {®(0),...,®(n)}. The martingale difference property for {A,,} 
follows, since 
E[An+i | Fn) = E[Any1 | FnlGndn 
Let in denote the unique index for which Cn(in) = 1 (that is, ®(n) = (X(n),U(n)) = (2, u’) 
with i =i,,), and denote 


| | 
Q 
a 


met = Hm (aim, ulm) + faim, ulm) + 7H" (X(n +1) — fin (On) 
= y{ H®(X(n+1)) 2s me He (eh 


where the final equation follows from (9.81). The conclusion E[A?,, | Fn] = 0 follows from the 
interpretation of the controlled transition matrix as a conditional expectation: 


Ana = yH?"(X(n + 1)) — E[yH™(X(n + 1)) | Fal - 


The notation H" = H®, and consequently H"(x*,u’) = 0,(i) is extended to the ODE ap- 
proximation, using the notation q rather than 3;. The ODE with vector field (9.81) defines the 
dynamics: 

fala, u) = —q (x, u) + c(x, u) + Pud, (x) (9.82) 


with g,(x) = min, g(x, u) and where the matrix notation is used: 
def 
Pug, (@ = 2 Pal (x, - )a,(x a’) 


We expect H"(x,u) © qz,,(x,u) for large n, and each pair (x, u) (subject to stability and consistent 
initialization—review Section 8.1.2 if the ODE approximation is not familiar). 

The ODE@oo (8.62) for the vector field (9.81) has a simple form. For any r >0and1<i<d 
we have 

r}fj(r0) = —H?(a',u®) + r—te(a?, u®) + 7 P,i(a', a’) H? (2’) 
a 
Letting r t co gives 
[foo(O)|s = —H9(a*,u) +>) Pa(a’, a’) HP (2') 


That is, the cost function is removed from the vector field. Stability is easily verified. 
Prop. 9.17 can be extended to Q(0)-learning with basis defined by binning—see Exercise 9.3. 


Proposition 9.17. For Watkins’ algorithm (9.80), 
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(i) The function V(0) = ||@|loo is a Lyapunov function for the ODE with vector field (9.81): 
#Y (1) < —(L— Vr) 
(ii) The function V.(0) = ||A||.. is a Lyapunov function for the ODE@ov: 
GiVoo( 8%") < —(1- 7) VOR) 
where the superscript “+” indicates right derivative. 


It follows that either of the stability criteria of Prop. 8.9 hold: V is a Lipschitz continuous 
solution to (QSV1), and the drift condition for V,. implies that the origin is asymptotically stable 
for the ODE@oo. 


Proof of Prop. 9.17. A proof is presented only for the simpler ODE@oo, whose ODE is expressed 
ude (2) = —aP°(z) + Pug (a), z= (wu) EXXU 
The Lyapunov function can be expressed 
Voo (ae?) = max{ max g7°(2"), — min 4*(z') } 
where z’ = (z',u*) for each i. Based on this we obtain a bound on the right derivative 


+ + i Ta i 
$.Voo (af?) < max{ max $9P°(z"), — min $ah°(2')} 
iel; iel, 


where J;* is the set of indices satisfying gP°(z’) = max; |q?°(z/)| if and only if i € J;*, and I, is the 
set of indices that are minimizers: q?°(z’) = — max; |q?°(z/)| if and only if7 € I,. At a given time 


t, either one of these sets can be empty, but not both. 
If J;* is non-empty and i+ € J;* then 


a + + ; 
ea (2) = a (2"") +7) | Palo, 2") min gf? (a, w’) 


; (aju)=zt? 
x 


Applying the definitions gives 
dt co(yit) < __oo(,it Cot) — (1 —v)V. (q® 
aide (2" +) S —aP°(2"" ) + ymax gp°(2") = —(1 — y)Voo(ae*) 
The same arguments imply the analogous bound for i~ € J, (when it is non-empty): 


Gar (2) > —aF°(z") + ymin gp? (z') = —(1 — 7) min gp? (z") = (1 — 7) Voo aF*) 


Putting these bounds together establishes the bound as Voo (gp) < —(1 — 7) Veo (Pe). O 
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9.6.5 Variance matters 


While stability of Watkins’ algorithm is easily established, it should also be clear that the asymptotic 
covariance of this algorithm is typically infinite. 

To apply the general SA theory we require the Jacobian of f at 6*. Existence of a derivative 
requires one more assumption: 


Lemma 9.18. Suppose that the optimal policy * is unique. Then the Jacobian A = Of (O*), 
with f given in (9.81), is given by 
A=-I+4I" (9.83) 


where T* defines the transition matrix for ® under the optimal policy: 
T*(i,j) = Pule',e)t{ul = o*(e)}, 1 <ij<d 


Proof. The uniqueness assumption implies that there exists « > 0 such that whenever 0 € R@ 
satisfies ||0 — 0*|| <e, 


*(x) = argmin H*(z, u), 2eX 
ucu 
fi(0) = —H®(x',u') + e(a*, u') + >, P.3(z',2')H? (a', p*(o')) i>1 


x 


It follows that f is linear in a neighborhood of the optimal parameter and 


_ def 0 ra = 0 O74 4 (ot af 0 O07.) axel 
Ajj (9) = 50,7 (0) = a (x',u’) +O Pale ©) 5g, # (x’, o*(z’)) 
= —-1f{i=j} 4+ yP, (zo )1{u = b*(2*)}, whenever ||@9 — 0*|| <e 


O 


Lemma 9.18 suggests trouble, since A has an eigenvalue at —(1 — y) with eigenvector v = 1. It 
is fortunate that this “worst” eigenvalue is known, so we can design the step-size rule to ensure a 
finite asymptotic covariance Ng (as defined in (8.6), with p = 1): 


Proposition 9.19. Suppose that the uniqueness assumption of Lemma 9.18 holds, and the step- 
size rule is modified: a, = min{aog,gav} with ag > 0, g > 1/(1—74), and av defined in (9.79). 
Then {0,,} is consistent, and the asymptotic covariance of the resulting algorithm is obtained as a 
solution to the Lyapunov equation (8.28a), in which the noise covariance is diagonal, with entries 


Da(i,i) =7E [(e(x(n +1)) — Pyh*(2"))? | (nm) = (2?,u) (9.84) 


where h* is the value function (9.69). O 


Many improvements are possible based on theory in previous sections. Relative TD-learning is 
easily extended to Q-learning and is far more reliable when y ~ 1. This is explained next. 
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9.7 Relative Q-learning 


The motivation for relative Q-learning is identical to the fixed policy setting of Section 9.5.3, and 
the DP equation (9.62) is unchanged except for a change of interpretation: 


0 = Eq|—H*(®(n)) — 6(u, H*) + on + yH*(X(n + 1)) | Fal, n>0 (9.85) 
where H*(x) = min, H*(x,u). The final conclusion of Lemma 9.13 holds in this more complex 
setting: H* = Q* — k, with k defined in (9.60): 

_ 6 
Loa 
We maintain the tabular basis in this section and the next, and restrict to the asynchronous 


setting, for which ® is a single sample path of (X,U). Following the same steps as in Watkins’ 
algorithm (9.75), we arrive at 


) 
k a a a ree 


Relative Q-learning 
For initialization 09 , Co € R%, the sequence of estimates are defined recursively: 


On+1 = On + An41Dn+1Y(n) 


Dry = —H"(8(n)) = 6(u, H*) ig 4H" (X(n + 1) (9.86) 


We opt for the step-size sequence (9.79) with scaling, and use the notation hy rather than %;, 
as in (9.82). We obtain from the foregoing Shi(x',u’) = fi(ht) with 


in which h(x") = min, h(x’, u), and P,:h (2*) = _s Pi (x*, 2’ )h(2’). 
Prop. 9.17 extends to the relative Q-learning algorithm with a slightly different ODE analysis: 


Proposition 9.20. The ODE@oo for relative Q-learning is obtained using the vector field (9.87) 
with cost set to zero: 


4 12° (z) = —AP(z) +-yPuh® (a) — 6(u,AP), z= (wu) EX XU 
The origin is globally asymptotically stable for any choice of y € [0,1) and 6 > 0. 
Proof. We cannot directly apply the Lyapunov function used previously, but instead opt for the 


span seminorm: 
V (0) = |l@llsp = o{max 4; — min 6;} = min max |6; — r| 
7 7 ii v 


Following the same steps as in the proof of Prop. 9.17 we obtain 
av (hee) < —(1— 7) V(he®) 


Separate arguments show that on setting r; = (u, h?°), there is k > 0 such that 


or, < —(L +6 —9)re + KV (hE) 


Given the bound V(h®) < e~"-*V (he), it follows that r; + 0 exponentially fast as well. O 
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9.7.1 Gain selection 


Lemma 9.18 is more easily extended, which is the first step in a variance analysis: 


Lemma 9.21. Suppose that the optimal policy * is unique. Then the Jacobian Ap = Of (0*), 
with f given in (9.87), is given by 


An =—-I+77T* -61@p (9.88) 


O 


Analysis of the eigenvalues is simplified under the following assumption, which is actually a 
restatement of the assumption that 7 is unique. 


The Markov chain with transition matrix T* is uni-chain: the eigenspace (9.89) 
corresponding to the eigenvalue A = 1 is one-dimensional. : 


Under this assumption we denote by {A;} the eigenvalues of T*, ordered so that A; = 1, and denote 


p’ =max{Re(A;):i > 2}, pi. = max{p*, 0} (9.90a) 
p = max{|Aj| : i > 2} (9.90b) 


The value 1 — p appeared in Section 6.3, where it was designated the spectral gap of the transition 
matrix. 


Lemma 9.22. The quantities defined in (9.90) satisfy pi. < p. 


Yan A(T") an (Ag) Ni Ae (An) 

_ -(1+6-7) 

1=1 ise ~ (1-7) ca 

aw, a. a 
o e At oe e—o| ea @ > 
LG: te te 

os \p* \ 6 re 


Figure 9.2: Relationship between the eigenvalues of the matrices T*, Ag, and An. 


Prop. 9.23 is a companion to Prop. 9.19, with a big improvement: 


Uniformly Bounded Asymptotic Covariance. The choice of gain g in relative Q-learning 
can be fixed, independent of y < 1, subject to a lower bound on 46, resulting in asymptotic 
covariance that is uniformly bounded for 0 < y < 1. One choice is 6 = y and g = 1/(1— p4). 


A positive spectral gap is not necessary for stability of relative Q-learning, or uniform 
boundedness of the asymptotic covariance. This observation is illustrated in Fig. 9.2, compar- 
ing the spectra of T*, Ap, and A, (the Jacobian obtained with 6 = 0). The plot of eigenvalues 
for T* indicates complex eigenvalues on the unit circle, so that p = 1: the spectral gap is zero. 
The plot on the left hand side shows that p* = p*_ < 1. 
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Proposition 9.23. For the relative Q-learning algorithm (9.86) with step-size rule (9.79), the 
matrix Ap, in (9.88) is equal to the Jacobian A, = dof (8)|5—o«- If 5 > y(1 — p%_), then each 
eigenvalue of Ap, satisfies Re(A(Ap,)) < —(1 — yp). Consequently, if the step-size is scaled via 
An(x,u) = gap (x,u) with g > 1/(1—yp%.), then each eigenvalue of Ap, satisfies 


Re (A(gAn)) = —gRe (A(I — [yI* — 5-1 @yJ)) < -1 


The asymptotic covariance of the resulting algorithm is obtained as a solution to the Lyapunov 
equation (8.28a). O 


9.7.2 Honest conclusions 


Prop. 9.23 suggests that we should abandon Watkins’ algorithm for its relative counterpart. In 
fact, a more detailed analysis reveals that we may view this proposition as a guide to gain selection 
for Q-learning in its standard form; that is, the recursion (9.86) with 6 = 0. 

With a bit more work we can show that while the asymptotic covariance of Watkins’ algorithm 
is typically infinite if g < 5(1 —v)7!, it is only infinite on a one dimensional subspace spanned by 
1, provided g > 0 satisfies the assumption of Prop. 9.23. 

Let Qn denote the estimate of Q* at iteration n, obtained using either Q-learning or relative 
Q-learning. The span-semi-norm of the error is denoted 


|Q" _ OQ" lep = min max |Q” (2,1) _ Q* (x, u) — r| 
Proofs of the following claims can be found in [114]: 


Proposition 9.24. Fir g > 1/(1—p*%.), and let ig > 0 be the solution to the Lyapunov equation 
(8.28a) using this g, A= —I+4|IT* —1@ uy], and Xa the diagonal matriz (9.84). 
The following then hold for any 6 > 0 (not excluding 6 = 0), with step-size sequence gay: 


(i) For any vector v € R® satisfying ee =, 


lim nE[(v™6,)2] = lim nvTE[6,6)]u = vTXgv < co (9.91) 


Nn—->Co N—->Co 


(ii) The scaled mean-square span seminorm vanishes at rate 1/n: 


sup nE[]Q" — Q*3,] < 00 o 


These conclusions are illustrated using the example of Fig. 9.1. Fig. 9.3 compares three algo- 
rithms, with two different discount factors, y = 1 — 107? and y = 1 — 107+. The three algorithms 
are distinguished by step-size (the value of g in a, = ga?) and the value of 6: 


1. Relative Q-learning with gain g, = Llp.) and 6 = "1. 
2. Watkins’ Q-learning (6 = 0) with gain g,, and 3. Watkins’ Q-learning with g, < 1/(1-7). 


Consider first relative Q-learning: Fig. 1.3 shows histograms of 6n,(i) from this algorithm for 
i = 10 and y = 1— 107° (based on 10° independent runs); the value of i chosen is not important— 
similar results are observed for each component. The CLT approximation is nearly perfect for 
n > 10*. The curse of condition number is solved in this example: the condition number (Aj) 
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Qlearning: ----- g=1/(1—p*y) Hg = 1/0 - 7) Relative Q-learning: — g = 1/(1 — p*y) 


7 120 
1/(L =) = 10° 


100 


Span norm: |Q” — Q"* Isp 


0 2 4 6 


10 


8 10 


8 
nx10° 


(a) Typical Sample Paths (b) Average Behavior: 1/(1—-) = 104 


Figure 9.3: (a) Span norm error for Q-learning and Relative Q-learning are similar for y ~ 1. (b) Average error for 
the three algorithms, with 1/(1 — y) = 10*. 


is less than 30 for all y < 1, while K(A,) is of order 1/(1 — 7), tending to infinity as y f 1 (recall 
(8.55) for the definition of «). 

It is also found that Watkins’ algorithm works great with gain g;,, when performance is measured 
in the span-semi-norm: Fig. 9.3 (a) illustrates the behavior of each of the three algorithms on a 
single sample path. Fig. 9.3 (b) shows the average error, obtained by averaging 10° independent 
runs of each algorithm. The evolution of ||Q” — Q*||sp is nearly identical using either Watkins’ 
algorithm or relative Q-learning, with common step-size gain gp, = 1/(1 — p47). 

Fig. 6.5 shows results for Watkins’ algorithm with the standard step-size a, = a? (g = 1), and 
y = 0.8 for which the asymptotic variance is infinite. The infinite variance is not apparent because 
of the poor design of the experiment, using 03 = 0 for each i. Remember the advice given below the 
figure: it is known that the limit is positive, so it would make sense to sample the initial parameter 
uniformly on a widely spaced interval of the form [0,7]. 


9.8 GQ and Zap 


This section contains more ideas to accelerate Watkins’ Q-learning, and also inspire new techniques 
for use outside of the tabular setting. However, theory in this section is restricted to the tabular 
basis. 
An understanding of the matrix gain algorithms 
described here requires a closer look at the vector field + oe! e2 
f° given in Prop. 9.15. Q° dt 
When the state space and input space are finite aay coves @3 
as assumed here, there are only a finite number of 
deterministic stationary policies. Denote these by ty 
{p”™ :1<m< M} where M is the number of possible ‘ ae oF 
functions from X to U. For each m denote f Qe” 


eo" ={#¢E R¢: (x) € argmin H°(x, u) for each eo * 

UU . ae 
Figure 9.4: Parameter space decomposition 
for Watkins’ Q-learning. The path {q} is 
one trajectory of the Newton-Raphson flow 
associated with f°. 


Each set ©” is a convex cone: 116! + 1962 € E™ 
whenever 6’ € O” and r; € Rx, fori = 1,2. Fig. 9.4 
provides an illustration of these sets. 

Let Q™ denote the fixed-policy Q-function obtained with policy 6”. On denoting by T;,, the 
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transition matrix for the resulting Markov chain ® with control 6™, we have 
Q™ = [I —7Tn]*e 


These Q-functions are also indicated in Fig. 9.4, which is justified by the identity (9.73). The index 
m* shown in the figure is special because Q™ € ©”, which implies the fixed-point equation: 


ob” (x) € argminQ™ (x,u) for each « € X 


That is, this step of policy iteration returns the same policy, which implies that Q™ = Q*. 
Let IIT denote the diagonal matrix II = diag (@) (also equal to RY as defined in (9.78)). 


Lemma 9.25. The Jacobian A = Of° is piecewise constant, with A(O) independent of 6 within 
the interior of 9™ for each m. If 0 € R® satisfies 0 € interior(O™) for any m, then at this value 
the Jacobian is given by 


A(0) = —IIT — yTn| 


def ae . (9.92a) 
uk Tli.9) =P le a te = oe yp, baa. 7 ed 
The function f° is thus continuous and piecewise linear, and for each 0 
f°(0) = A(0)6 — IIb, where bj = —c(x',u’) for each i. (9.92b) 


9.8.1 GQ-Learning 


This algorithm is defined exactly as in the deterministic setting of Section 5.4.4, with the same 
objective: solve - 
min T(6) = min 3{f°()}"M f@) 


where once again M~! = Ea[¢n¢n] with Cn = %n) (recall (9.4) for notation). For the tabular basis 
this reduces to M~! = II, and ['(6*) = 0. 

The ODE method based on gradient descent will provide a recursive algorithm, as in the de- 
terministic setting considered previously (recall (5.55)). First we must interpret the ODE. The 
following representations are implied by (9.92b). 


Lemma 9.26. If @¢ R¢ satisfies 0 € interior(O™) for any m, thenT is quadratic in a neighbor- 
hood of 0, with partial derivatives 


VI (6) = A(A)™M f° (8) = -[T — 9Tn]" f°) 
VT (9) = [I _ Tm) WL _ Tm] 


GQ-learning is the two-time scale SA algorithm, designed to approximate o9 = —VI (9), with 
—VI (0) = f°(0) — yi. f°(0). A challenge is to interpret the product: 


TLE ()|, => F}O)Imli,t), 8 € interior (0) 
J 
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For the tabular basis this can be expressed 


THP(B)| = D2 FP()Tm( 5, R)va(ak, wh = LO) Yorsr | O(r) = @,7)]|, (9.93) 


where Vontt) = 9(X(n+1),6"(X(n+4+1))) for 6 € interior (0). 
This representation lends itself to algorithm design. Recall from (9.74) that , denotes the 
greedy policy associated with H”. 


GQ-learning 
For initialization 0), wo € R%, 
On+1 = On + Onsi{Pnsiden) — YL WnyP mans (9.94a) 
Wn+1 = Wn + Br41¥(n) {Dnt — Vip) Wn} (9.94b) 
where Venta) = w(X(n+ 1), bn(X(n 4+ 1))) 
Dy = —H"(®(n)) + en + YH"(X(n + 1) 


where the two step-size sequences satisfy (8.22). 


The analysis that follows concludes with Prop. 9.27, which implies that the condition number 
for the linearized ODE dynamics can be expected to be of order O(1/(1 — y)?). Consequently, GQ 
learning is probably not the best option in a tabular setting. It is presented here because it may 
be useful in function approximation settings for which it is not known if f°(@) = 0 has a solution. 


GQ analysis The fast time scale recursion (9.94b) is designed so that w, ~ M f°(0,) for large 
n. Theory for two time-scale SA provides an approximation of (9.94a): 


On+1 On + Oehr{ Dati Gn = VP (On) Morb py} 


We conclude that the ODE approximation of this recursion is gradient descent on applying Lemma 9.26 
and (9.93). 
The curse of condition number is potentially worse when using this algorithm: 


Proposition 9.27. Suppose that the optimal policy b* is unique. Then, the linearization matrix 
for GQ-learning is given by 


Agg = -V20 (6*) = -[f — yO - 77") 


Consequently, the eigenvalues of the matriz Aaq are real and non-positive, and its condition number 
admits the lower bound 


k(Agaq) = Sat max( (1 - Ay)?vT Tv) } 


al 


where the max is over all eigenvalue-eigenvector pairs (A,v) for T* satisfying ||v|| = 1. 
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Proof. Because Acq = —V7I (6*) is symmetric, its condition number is the ratio of eigenvalues: 
Neve VE CO") 
A = Oe 


It remains to establish the following bounds: 


Amin(V7P (6*)) < (1 — y)?/d (9.95a) 
Amax(W2T (6*)) > (1—Ay)?uTIlv, for any eigenvalue-eigenvector pair (A,v) for T*. — (9.95b) 
where in (9.95b) the eigenvectors are normalized, with ||v|| = 1. 


The proof is based on the pair of inequalities 
Amin (wr (0*)) = minvTV7P (6*)u 
Neel VE (0*)) = maxv'V7P (O*)u 


where the max and min are over all v € R® satisfying ||v|| = 1. To obtain bounds we restrict. to 
eigenvectors of T*, normalized by ||v|| = 1, from which we obtain 


vW7T (6*)u = oT — yI* | — yT* Ju = (1 — Ay)?v Tv 
and hence the pair of bounds 
Amin(V2T (6*)) < (1 —Ay)?uTHv < Amax (VP (0*)) 


The lower bound (9.95b) follows. 
To obtain the bound (9.95a) on the minimal eigenvector, take v = 1/V/d, which is an eigenvector 
of T* with eigenvalue A = 1: 


Amin(V2P (0*)) < (1 — Ay)20TTy = (1 — »)?/d s 


9.8.2 Zap Q-Learning 


To crush the condition number curse, there is no better approach than Zap SA. This is easily 
adapted to Q-learning, even in non-linear function approximation settings. For analysis we continue 
to restrict to the tabular setting. 

The quasi-linear SA representation of Watkins’ algorithm in (9.76) is most easily adapted to 
create an algorithm. 


Zap Q-learning 
For initialization 69 € R? and Ae € R¢*4, with {An, by} defined in (9.76), 


n 


An+1 =— Ag op Bratt ~Ay + An+1} 


ae (9.96) 
On+1 = On + Ant+1Gn+1 [An+19n = bn+1| ; Gn41 = =A, vi 


where the two step-size sequences satisfy (8.22). 


Pre-publication draft -- March 25, 2022 


CHAPTER 9. TEMPORAL DIFFERENCE METHODS 349 


This algorithm is intended to approximate the Newton-Raphson flow: 
ae = Gd) f?(9) 


with G(8) = —[09f° (8)|-!. Based on the representation for f° in (9.92) along with Lemma 9.25 
which gives A(@), and using the suggestive notation q in place of 8, this becomes 


a =—-G@- A(q:)~ ‘IIb =-$+Q”, q@ € interior(O”) 
On denoting g” = q@ — Q™ we obtain linear error dynamics in the region O™: 
fa =-G (9.97) 


These dynamics are illustrated in Fig. 9.4 with m = 3. 


Experimental histogram © —— Experimental density ------ Theoretical oa 
n Wn (18) fs fi 
c wa : = 0.80 
x ae : 
o on 
40 5 0 5 10 15 x10? 4 = 0 . 4 6 x10? 20 -10 10 20x10? = 5 10 x10? 
ie , = 0.80 

ro i™ 
& f : 
fe ( » 
— BE ~ 
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fo} ‘i = 0.99 
Qa w™ 
© 
~ _ A 

2 45 4 05 0 05 1x10° -8 6 -4 --2 0 2 4 6 8x10° . 15 -10 5 O 5 10x10? -8 6 -4 2 0 2 4 6 8x10? 

n= 10" n= 10° n= 10* n = 10° 


Figure 9.5: Comparison of theoretical and empirical asymptotic variance for Q-learning applied to the 6-state 
example. Row 1: Watkins’ algorithm with gain g = 70, y= 0.8. Row 2: Zap Q-learning with y = 0.8. Row 3: Zap 
Q-learning with y = 0.99. 


Testing the CLT Zap Q-learning has minimal asymptotic covariance, given by (8.30): 
> def ADa(At)t 


For illustration, consider the six state example shown in Fig. 9.1, for which it is possible to compute 
both A and Ya [110]. With 1 = 10° independent runs, histograms of the normalized error are 
obtained: 


Wi = Jnl, —F,], Ir“ 76i, (9.98) 


Fig. 1.1 shows a single sample path of Zap Q-learning, compared with several other algorithms 
using the step-size a, = 1/n (a poor choice, since in this plot the discount factor was taken to be 
7y = 0.99). 
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Fig. 9.5 shows histograms obtained using Zap Q-learning, and also Watkins’ algorithm using 
step-size A, = g/n with g = 70 (resulting in a finite asymptotic covariance for the discount factor 
-y = 0.8 chosen). 

The first row shows the histogram for Watkins algorithm, and the second and third rows for Zap 
Q-learning, with 8, = (an)°®> = 1/n°*°. Data in the third row was obtained with the discount 
factor y = 0.99. The covariance estimates and the Gaussian approximations match the theoretical 
predictions well for n > 10%. 

Analysis of this algorithm falls way outside of the scope of this book, because A(@) is not 
continuous, so there is no “off the shelf” theory to justify an ODE approximation. Exploiting 
concavity of the entries of f°, it is possible to justify an ODE approximation, and this can be 
extended to general nonlinear function approximation (with substantial effort) [90]. 


Zap Zero Zap Q-learning is blindly fast in practice, but the update equation for 6,41 in (9.96) 
is complex. We have a solution to this complexity using the first order Zap SA algorithm (8.52). 


120 
| 
100 | —— Zap thalds Watkins g= 1/a — 7) J 


80 —— ZapZero ---- PJRaveraging 5 


Span norm: |Q” — Q* Ilsp 


0 2 4 6 8 wiot 10 


Figure 9.6: Span norm error for Zap Zero Q-learning 


The algorithm is applicable because one crucial assumption of Prop. 8.7 holds: A(0) is Hurwitz 
for each 6. With {A,,b,,} defined in (9.76), the algorithm (8.52) is expressed as follows: 


Zap Zero Q-learning 


Initialize 0) , wp) € R¢. Update for n > 0: 


On41 = On + An41Wn (9.99a) 
Wn+1 = Wn + Bnti{An+1Wn a (Ani16n a bn+41)} (9.99b) 


where the two step-size sequences satisfy (8.22). 


The fast time-scale recursion (9.99b) is designed to obtain the approximation 
Wn % —A(On)~1{A(On)On — bY (9.100) 


The Zap Zero algorithm is not much more complex than Watkins’ original algorithm. We have 
doubled the number of parameter elements since we must update 0, and wy, at each stage, but the 
updates are not at all complex. 

When this and Zap Q-learning are compared for the example considered in Fig. 9.3 we observe 
that the span seminorm error for the respective parameter estimates {6,,} are very similar. Results 
are shown in Fig. 9.6, alongside results using PJR averaging and Watkins’ algorithm with gain 
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Qn = ga’, and g = 1/(1— 7). The transient behavior of Zap Zero Q-learning is better than what 
is obtained using averaging, and a bit worse than Zap Q-learning. 
For Zap Q-learning, any value 0.6 < p < 0.9 for 6, = n~? in (9.96) gives similar performance. 
However, for both Zap Zero and PJR averaging, it was found in this example that a very large 
step-size is required for the fast time-scale. These results were obtained using 6, = n-? with 
p = 0.1. Performance was terrible for Zap Zero and PJR averaging with p > 0.5. New theory is 
needed to explain these findings. 


9.9 Technical Proofs* 


9.9.1 Advantage function 


Proof of Prop. 9.1. The conclusion G*(«x) = E[Q(®(n)) | X(n) = 2] = 4%, b(u | 2) Q(a, u) = Q(x) 
is obtained upon recalling the definition of the conditional expectation in (9.15). Recall that Q = h 
is the value function (9.6a). O 


Proof of Prop. 9.2. Eq. (9.11) can be extended to show that for each x € X and k > 0, 


E* | Vi((k)) | X(0) = 2] = E® [e(@(k)) + yho(X(k +1) — ho(X(&)) | X(0) = a 


Consequently, 
EPS Vo (BE) | X(0) = 2] = EP |S 7*{e(@(A)) + yho(X(k + 1)) ~ ho(X(A))} | X(0) = 2 
k=0 k=0 
= E*[—ng(X(0)) + >> rFe(@(R)) | X(0) =a 
k=0 


O 


Proof of Prop. 9.14. The representation (9.65a) is the definition of V, and (9.65b) follows from 
(9.65a) since (in the compact notation) Q(x,u) = c(a#,u) + yP,h(x). The final representation 
(9.65c) follows from (9.65b) and the dynamic programming equation h = cy + yPyh. O 


9.9.2. TD stability theory 


Bounds on the eigenvalues of A for on-policy TD(A) are established by careful consideration of the 
autocorrelation sequence (9.23). This can be expressed 


RG) ==UG)+90", Get, (9.101) 
where 7 = E,[y(X(n))], and letting wb =w—y, 
Dj) = Enlb(X(n + f))b(X(n))7] 


Eigenvalue bounds are obtained in terms of the scalars 


— ly! R(i)y| 
say = ya 9.102 
oa OSE ay ( ) 


where y' denotes complex conjugate transpose, RY = R(0), and the maximum excludes y = 0. 
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Lemma 9.28. We have 9 <1 and 0, <yo< 7. If oY > 0 and ® is aperiodic, then 0 <1. 


Proof. It is obvious that 0, < ye whenever 0 < A < 1. We proceed to bound o. 
We have from the Cauchy—Schwarz inequality, for any non-zero y € C, 


ly R(é)y| = |Enly'y(X(n + 4))v(X(n))T yl 
< Enlluty(X(n +4)? VEo Xm) oP] 
The right hand side is precisely E,[|y(X(n))Ty|?] = y' RY y, from which we conclude that 9 < 1. 
To complete the proof we obtain a strict inequality under aperiodicity and the full rank condition 
x” > 0. It is sufficient to restrict to i = 1, and show that |y'R(1)y| < y'RYy for any non-zero 
y € C when &Y > 0 and © is aperiodic. 
The Cauchy—Schwarz inequality tells us more: if equality holds, then there is a complex number 
w satisfying |w| = 1 and yiy(X(n +1)) = wyty(X(n)) with probability one. Since n is arbitrary 
we can iterate to obtain with probability 1, 
yv(X(n+8)) =wiyli(X(n)), be By 
Multiplying each side by ~(X(n))T on the right and taking expectations gives 
y'R(i)y = wy Ry 
Aperiodicity tells us that lim R(i) = lim E[y(X(i))(X(0))"] = ov. If y = 0 it follows that 
oo 1-00 


y' RY’y = 0, which contradicts the assumption RY > 0. - 
Otherwise, we take expectations to obtain yi = w'yiy, giving w = 1, and hence 


yd]? = lim y Ry = y RYy 
1 CO 
However, on applying (9.101), 
ytd)? = yt R¥y = yl[EY + dd ]y = yldYy + [yo 


This implies that y'S”y = 0, violating the assumption that SY > 0. O 


Proof of Thm. 9.7. Part (i) is from the definitions, since we have 
ExDrivi(X(n))]=0, 1si<d 
Part (ii) requires interpretation of 6* for TD(1). Lemma 9.9 tells us that A = —R” and (9.38) 
then gives 


—R°O* + EnlGnc(X(n))]=0, Gre = p(X (n — i) (9.103) 


To complete the proof we show that (9.103) coincides with pe necessary and sufficient condition 
for optimality of (9.33) with norm (9.34). 

Let 6° € R?@ denote a solution to (9.33). Recalling that Voh? = y, the first order condition for 
optimality is expressed 


0 = 3VollA? — AIP] = So(h(@) — h(x) Von" (x) (2) 


LEX 
= RYO — Ex[h(X(n))v(X(n))] 


6=6° 
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In view of (9.103), it remains to show that 
Ex[h(X(n))o(X(n))] = Enlene(X (n))] (9.104) 


This is obtained from the definition of the value function: 
Ex{h(X(n))(X(n))] = Ex] E [dove o(X(n +4) | X(n)|w(X(n))| 
The identity (9.104) then follows from the smoothing property of conditional expectation 


Ex{h(X(n))(X ry) = Bal el (n+4))d(X(n))| 


Z So aEale(X (n+ NK (M)) 


= DV Enle(X(n) (Xn — t))] = ExlGre(X (n))] 7 


Proof of Prop. 9.8. We confront a notational clash in analysis of TD(A), since a focus is bounding 
the eigenvalues of a matrix A. In the remainder of this section an eigenvalue-eigenvector pair is 
denoted (n,v) for which v is not zero, and Av = nv. 


Real(n) < —(1— 9,)2° 


Figure 9.7: Left: Eigenvalues of A satisfy Real(n) < —(1—@,)z°. Right: eigenvalue expressed in terms of eigenvector 
and autocorrelation sequence. 


Let (n,v) denote any eigenvalue-eigenvector pair for A, satisfying ||v|| = 1. The eigenvector 
equation Av = nv together with Lemma 9.9 give the formula for n shown on the right hand side of 
Fig. 9.7. 

Let 2° = v' RYv > O and w =n +29 EC, so that n = —z9 + w. It remains to show that 
|w| < 0,2°, so that y lies within the closed disc shown in Fig. 9.7. This follows from Lemma 9.28: 

[o-e) (oe) 
Jw] < (A74 -1) 5% yA)*Jul R(i)Tv| < ofvl(RY)To LS OA O 
i=1 w=1 


Proof of Prop. 9.12. Each part of the proof is based on properties of the function p(y) = det(A,), 
which requires an alternative representation of Ay. 
Following the proof of Lemma 9.9, we write ¢, = 0 (AY) Mina); and hence 


Ay = 3007) 'Ex [Pana (—Yin) + Worst) | 


i=0 
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We always have Aj = —R”, so that p(0) = det(R”) 4 0 for any value of 4. 
For A = 0 this simplifies to 


def 


A,=—-R’+>B, with B&E,[b(X(n))b(X(n + 1))7] 


Part (i) is established on recognizing that p(y) is a polynomial function of degree d that is not 
identically zero, and is therefore zero for at most d values of 7+. 

The proof of (ii) is similar: in this case the representation for A, implies that we can extend 
the domain of p to the interval (—e, 1 +¢) on which p is an analytic function of y. It therefore can 
have at most a finite number of zeros on the interval [0,1] C (—e,1 + €) [304]. 

The proof of (iii) is the same as (ii) with one exception: the domain of p can be extended to 
define an analytic function on the smaller interval (—e, 1). Hence it has a finite number of roots on 
each closed interval [0,1 — 6] Cc (—e, 1). O 


Proof of Lemma 9.13. Denote W, = yP—61@ uw. 

For (i) first note that 1 remains a right eigenvector of the matrix W,, with eigenvalue y— 4. To 
characterize the remaining eigenvalues of W,, observe that if 7 # y — 6 is an eigenvalue, then an 
associated left eigenvector v € R¢ must be orthogonal to 1. Consequently, 


qui = ul (yP —61®@ 1) = yu'P 


Hence v is also a left eigenvector of P. 
Part (ii) then follows from (i) since the eigenvalues of W, lie within the open unit disc in C. 
For (iii) we obtain a representation of 


[-W,|-1=(A+UV)', with A=(I-yP), U=61,V=u 
The Matrix Inversion Lemma (A.1) gives 


= 1 
[f—W,)t=A1t- Au (14 VATU) VAT = ATK of UV AT 


where g=1+VA1U =1+ (p,6(I —yP)11) =14+6/(1-), 
A‘U =6(-yP)'1=6/-7), VA =p —yP)* 
Substituting these identities gives (9.61): 


6 


UW) t = (F9P yt 


[1@ul-yP)* 


and consequently 


) 
1+6-y¥ 


eS 
1+6-y¥ 


H = [I -W,]“e = (I-7P)-'c (u,(I-P)~*e) = Q (HQ) 


This gives one representation of k, and the other is obtained on taking the mean of each side of 
this expression with respect to w. O 
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9.10 Exercises 


There are not many exercises in this chapter, and none in the next. What follows are intended 
to fill some theoretical gaps. 


9.1 Consider the M/M/1 queue with controlled transition matrix (7.30) and cost function c(z, u) = 
x for all x,u. The relative value function was derived in Section 7.3, based on the fact that the 
optimal policy is @*(x) = 1{x > 1} (it is “non-idling”). The policy remains non-idling for the 
discounted-cost criterion, but the value function h* is no longer quadratic. 

(a) Show that h*(x) = ax +b+4+r* for x € X, where a,b,r and § are constants. For this you 
should solve the DP equation (9.6a) with the knowledge that ¢* is non-idling. 


(b) Compute Q* using (9.70) together with your formula for h* in (a), and verify that * is 
obtained as its minimizer. 


9.2 The approximation (9.100) begs the question: why approximate —A(,)~'A(@n)0n = —On? 
The following algorithm is designed so that w,, ~ A(6,)~ 1b, with —0, moved to the slow recursion: 

On+1 = An + An+1{—In + Wr, } (9.105a) 
Wy + Bn+1{An+1Wn = bn+1} (9.105b) 


Wrtt 


(a) Show that for TD-learning, (9.2) is precisely PJR averaging when a, = 1/n. 


(b) Compare (9.2) and (9.99) on a tabular Q-learning example, such as the six-state example 
shown in Fig. 9.1. Obtain plots and histograms to compare both the transient behavior and the 
asymptotic covariance. Review warnings at the close of Section 9.8.2 regarding the choice of {Bn}. 


9.3 Q(0)-learning is defined by (9.72) using ¢, = VoH* (®(n)). Consider this algorithm with 
linear function approximation, using the basis (5.11) defined by binning. Assume that the Markov 
policy for exploration is chosen so that @(B;) > 0 for each bin B;. 


The algorithm is stable in this very special case through an extension of Prop. 9.15: 


(a) Obtain the vector field for the ODE approximation. It will be similar to the form 46, = f°(:) 
obtained in Prop. 9.15. 


(b) Obtain the ODE approximation with matrix gain designed to approximate [RY]~!, which is 
diagonal for this basis. 


(c) Show that the function V(0) = ||6l|. used in Prop. 9.17 remains a Lyapunov function for 
Q(0)-learning with this basis. 


9.4 Find an example for which step 3 of the ODE method fails in the case of Q(0)-learning for 
LQG. Take {H® : 6 € R®} to be a linearly parameterized family of quadratic functions on X x U, 
and verify in your example that H® is not Lipschitz continuous as a function of 0. 


Propose a method to modify the algorithm so that the Lipschitz condition is satisfied. 


9.5 Shown below is an example from [355] for which TD-learning may be unstable: 


@ ay 


> 


The setting is similar to Baird’s counterexample, illustrated in Fig. 5.5, except there are only two 


Pre-publication draft -- March 25, 2022 


CHAPTER 9. TEMPORAL DIFFERENCE METHODS 356 


states with X = {1,2}. There is no control, and the dynamics are deterministic, with state 2 
absorbing. The cost is zero, so that h*(x) = Q*(z,u) = 0 for all x, u. 


We would like to estimate h* using TD-learning with d = 1 and W(x) = 1+1{x = 2}. Like Baird’s 
example, this violates our convention that ~(x°) = 0, but we do have h* = h® with 6* = 0. 


Obtain a formula for the temporal difference, similar to (5.51), and perform the following tasks: 


(a) Verify that TD(0) is not stable for some values of y € [0,1] when “perfect exploration” is 
adopted (for this you must review discussion surrounding (5.51)). Approach: The linearization 
matrix A is a scalar in this case. Show that A > 0 for some values of ¥y. 


(b) Show that a less uniform sampling will lead to a stable algorithm: Choose X i.i.d., with 
€ = P{X(n) = 1} positive but small. This is similar to an “e-greedy policy”, even though there is 
no control. 


9.11 Notes 


Temporal difference methods As surveyed earlier in Section 5.9, the TD-learning algorithms 
developed by Sutton and Barto in the 1980s were designed to obtain approximations of value 
functions within a finite-dimensional parameterized class, with emphasis on both linear function 
classes and neural networks. Sutton’s dissertation contains early insights on temporal difference 
methods and some of the first TD algorithms [26, 339, 340] (see also Williams [376] for more early 
references). A fuller history of RL origins can be found in [338, 347]. 

The seeds planted by the early RL explorers prompted a flurry of analysis in the 1990s (much 
of it led by Tsitsiklis and his students at MIT), along with many new algorithms and analytical 
techniques; more on the contributions of the MIT school can be found in Section 10.10. 

The terminology split sampling is due to Borkar [66], but the use of multiple sampling in RL, 
such as in eq. (9.55), has a longer history in RL [21]. 

The interpretation Thm. 9.7 (i) can be found in [338, 347], and part (ii) is due to [356] (more 
history on minimum norm solutions can be found in Section 10.10). 

Prop. 9.12 is adapted from [187, Theorem 4.1]. 

Baird introduced the advantage function in [22], and soon after he proposed application to 
policy gradient methods in [23]. In Baird’s work and more recent research, estimation of the 
advantage function requires a parallel algorithm to estimate the value function. The algorithm 
(9.68) and refinements in Section 10.2 avoid parallel estimation of the value function without 
sacrificing accuracy—see Prop. 10.7. These algorithms and supporting theory appear to be new. 

Finite-n performance of single time-scale SA algorithms with application to TD-learning was 
studied in [851, 210], and bounds for two-time-scale SA algorithms were obtained in [101]. However, 
these works rest on a critical assumption: that the noise is a martingale difference sequence. 
Extensions to Markovian noise presents a significant challenge [89, 59]. 


Q-learning Watkins’ algorithm was introduced in his dissertation [371], with further analysis 
following in [372]. It was soon understood that the ODE approximation is easily analyzed in this 
tabular setting through Lyapunov techniques, as seen in Prop. 9.17. Similar techniques are available 
to establish stability of Q-learning for optimal stopping with linear function approximation [353]. 

Prop. 9.20 and the variance analysis is taken from [114]. The paper [111] addresses high variance 
for large discounting by estimating the gradient of the value function (so the theory is applicable 
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only when X is Euclidean space). 

Soon after stability was established, Szepesvari investigated the rate of convergence. Using a 
clever coupling argument introduced in [227], the following upper bound is obtained in [346] for 
Watkins algorithm with the state-dependent step-size a}: 


1 
n-y)r ’ 


lH" (¢,4)— Q"(a,u)|< B n>1,2E€X,ueu 

with B a constant and r = @/@ (the ratio of the minimum and maximum of @). The bound is 
only valid for sufficient large y (it is not realistic to have (1 —y)r > 1/2). While only an upper 
bound, this suggests that performance is very poor when the discount factor is close to unity. See 
[124, 18] for extensions and refinements. 

It was first established in [112, 113] that the convergence rate of the MSE E|||@, — 4*||?] of 
Watkins’ Q-learning can be slower than O(1/n?"-1), if the discount factor satisfies y > 5. It was 
also shown that the optimal convergence rate (8.8) is obtained by using a step-size of the form 
Qn = g/n or A = gay, for g > 0 sufficiently large. 

Stability theory is not well developed outside of very special cases, such as the use of binning 
in Exercise 9.3. This exercise is inspired by Gordon [146] who describes this and other successful 
function approximation architectures for Q-learning. A generalization of binning called soft state 
aggregation was introduced in [324]. Stability theory for off-policy TD-learning faces similar chal- 
lenges as Q-learning [345, 249, 219]. Counterexamples show that conditions on the function class 
are required in general, even in a linear function approximation setting [21, 355, 341, 147]. 


GQ and Zap The GQ-Learning algorithm of Section 9.8.1 was introduced in [345] for linear 
function approximation (see also [235]). Convergence theory was extended to non-linear function 
approximation in [55]. 

The LSTD(A) algorithm (9.42) was introduced in [76] (see also [71, 271]). It is an instance of 
Stochastic Newton-Raphson, but the original motivation had nothing to do with minimizing the 
asymptotic covariance. 

Recall from Section 4.11 that the Newton-Raphson flow was introduced by Smale [325]. The 
Zap Q-learning algorithm was introduced in [112, 113] without knowledge of Smale’s theory—the 
matrix gain was motivated by minimizing the asymptotic covariance, rather than to create a general 
tool for the creation of consistent algorithms. 

The proof of convergence based on Fig. 9.4 first appeared in [112], and soon after it was realized 
that the ODE approximation (9.97) was valid outside of the tabular setting [90, 110, 107]. 

The discontinuity of the vector field f° initially presented a challenge in analysis of Zap Q- 
learning: it wasn’t obvious that the ODE approximation could be justified. In [90] this open 
question were resolved by appealing to special structure in Q-learning. The theory in [90] is 
completely general (applicable even to nonlinear function approximation architecture, such as neural 
networks). A central idea can be explained within the context of the tabular setting. The matrix 
An+1 appearing in (9.96) is a sub-gradient of f,+41 at the value 0, in the following sense: 


{fn+1() — fnt1(On)}i S D0 Anti, OG) —On(9)], 1 Sid, OER* 
j 


This holds for the algorithm (9.76) with linear function approximation, provided the basis vector 
has non-negative entries, w: X x U > R¢. It is not difficult to show that a similar inequality is 
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preserved in any ODE limit, which is all that is needed to show that V(0) = || f°(0)||2 serves as a 
Lyapunov function for the SA algorithm. 

Relaxing the positivity assumption was a bigger challenge than confronting a nonlinear function 
approximation architecture. 

The article [90] also contains many numerical examples illustrating the application of Zap Q- 
learning with neural network function approximation. 


Convex Q-learning Extensions of the convex Q algorithms of Section 5.5 are described in 
(230, 247]. The main challenge in algorithm design is that the constraints involve a conditional 
expectation. For example (5.63) would be modified as follows: 


max (u, H°) 
(9.106) 
s.t. H°(x,u) < c(x,u) + E[H?(®(k +1)) | ®(k) = (a, u)| rEX, we U(z) 


The conditional expectation can be approximated to obtain an algorithm, as discussed in Sec- 
tion 9.2. The use of experience replay is another approach based on empirical distributions (see 
(338, Ch. 16] and [211]). Theory for convex Q learning remains immature, so best left to a sequel 
or second edition. 

The recent paper [218] is based on a variant of the convex program (9.106) in the tabular 
setting, and [224] contains an algorithm very similar to convex Q learning (see [224, eqn. (6)]). 
The more recent RL survey [269] has a version of convex Q learning, and explains the importance 
of regularization. Also related to (9.106) is the logistic Q-learning algorithm of [28], which may 
represent an opening for more practical algorithms, as well as more elegant theory. 
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Chapter 10 


Setting the Stage, Return of the 
Actors 


This chapter introduces techniques to improve TD-learning, along with algorithms for the average 
cost optimality criterion. There is an emphasis on geometry surrounding minimum norm problems, 
as first discussed in the paragraphs surrounding (6.22). Alternative proofs of Thms. 9.7 and 9.11 
that expose the underlying geometry will lead to new tools for algorithm design. 

You should be asking yourself, why do we care about solving the minimum norm problems posed 
in Thms. 9.7 and 9.11? In particular, what is your motivation to solve the optimization problem 
below? 

6* = arg min || H° — QZ = >) (H(z) — Q(z))?@(z) (10.1) 
Q zEZ 
with Z = X x U. Mathematical elegance may provide ample motivation, but up to now there is no 
evidence that this is a useful metric for success in control design. 

Put on your ‘control hat’, and look back to the Inverse Dynamic Programming (IDP) discus- 
sion in Section 3.4. Our goal there was not to approximate a value function, but to ensure that 
any approximating function J we choose comes with a cost function c’ with desirable properties: 
namely, it is coercive and c? ~ c. In the stochastic setting, the average cost criterion is most closely 
related to the total cost setting of Section 3.4, as made precise through examples and some analysis 
in Section 7.2. In particular, recall the approximation (7.26): 


o° + J*(x) = min{e? (x, u) + Py J* (x)} 


in which J* denotes a value function for the fluid model, and c’ & c if and only if the Bellman 
error is small in the span seminorm. 


To understand the practical value of the optimization criterion (10.1) requires an entirely dif- 
ferent control-theoretic “wardrobe”, which is the topic of Section 10.5 and beyond. It is explained 
there why the first half of this chapter sets the stage for actor-critic methods. The actors are de- 
fined by a family of randomized policies {h :0€ R44}, The goal is to create an efficient algorithm 
that estimates the “best” policy within this family, where the notion of ‘best’ can be defined in 
terms of discounted cost, total cost, or average cost. 

We discover in Section 10.4.1 that most optimality criteria can be converted to average-cost 
through design of the framework for simulation or experimentation. For this reason, the theoretical 
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development of actor-critic methods focuses entirely on the average cost criterion. The relative 
value function plays the role of critic, and theory surrounding TD(1) for average cost is used to 
construct stochastic gradient descent algorithms; these algorithms are designed to eliminate the 
bias that is inherent in the actor-only methods surveyed in Section 4.6. 

One conclusion from Section 10.6 is that actor-critic methods may be regarded as a disciplined 
approach to Q-learning. For comparison: 


A Q-learning seeks to solve the Galerkin relaxation (9.71), which is not easily justified outside 
of the tabular setting. 


a Actor-critic methods use (as a sub-routine) a variant of Q-learning, which is an essential 
ingredient to obtain an unbiased gradient estimate. 


10.1 The Stage, Projection and Adjoints 


The geometry developed in this chapter requires a linearly parameterized function class. If space 
permitted, we might allow a RKHS for this purpose. Since this is not an option, a d-dimensional 
basis 7): Z + R@ is chosen, and any approximation is expressed H® = 6Tw with 0 € R?. 

Regardless of the interpretation of Q in (10.1), the approximation H® is known as the projection 
of Q onto the linear subspace H = {H® : 6 € R%}, and can be expressed 


H® (®(k)) = E[Q(®(k)) | Y] 


with Y = w(®(k)) (recall (9.19) for the definitions). These interpretations were discussed in 
Section 5.4.1, so we adopt the notation from Chapter 5: for any two functions g,h: Z > R, the 
inner product is defined by 


(9, h)o = Ealg(®(k))h(®(k))] = D> g(2)A(z)@(z) 
zeZ 
so that 
|H° — Qlla = (H" — Q, H® — Q)o 
Recall that we write g € L2(@) whenever ||g||~ < oo. This comes for free since we assume in this 


chapter that X and U are finite. 
The conclusions of Props. 5.7 and 9.4 are restated here for ease of reference: 


Proposition 10.1. Suppose that {y;} are linearly independent in L2(@): ||6™||o = 0 implies 
that 0 = 0. Then, for any function G € L2(@) the projection exists, is unique, and given by 
G=?"p wih = RO)", where ye € R? and the d x d matrix R(0) = RY are defined by 


Be = (iGo, RE, = Windy)o, 1<ijsd (10.2) 
O 


Some mystery comes when G is the Q-function. For the discounted cost criterion, the vector 
nd € R¢@ can be expressed 


0 = Eold((0))Q(®(O))] = $7 Eol(®(0))e(®())) (10.3) 
k=0 
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where the second equation follows from the definition (9.6b) together with the smoothing prop- 
erty of conditional expectation. Estimation of the right hand side using Monte-Carlo methods is 
challenging because it involves the state-input trajectory over the infinite future. 

The mystery is unmasked with the use of a bit of linear algebra. 


10.1.1 Linear operators and adjoints 

Let T: L2(@) > L2(@) be a linear operator. That is, for any g,h € L2(@) and scalars a, 8 € R, 
T(ag + Bh) = aT(g) + BT(h) 

It is customary to write Tg rather than T(g) when there is no risk of confusion. 


For TD-learning with discounted-cost criterion, the linear operator of interest is defined by 
lo) 
Tyg (2) “ S > *Elg(®(k)) | (0) = z], for any g € Lo(@) and z € Z, (10.4) 
k=0 
so that Q = Tc. Basic theory of linear operators provides tools to efficiently estimate the vector 
pe appearing in (10.2) with G=Q. 
The main concept is the adjoint of a linear operator T, denoted T'. This is defined by the 
simple identity, 
(Tg,h)a=(9,T'h)o, — for allg,h € Lo(@) (10.5) 


In this finite setting there is a simple formula, obtained on expressing T’ as a matrix: 


T9 (2) = Y>T(2,2)9(2') 


Z'EZ 


Lemma 10.2. The adjoint of T is equal to its transpose, followed by a similarity transformation: 


1 
nF \ / / / 
Tae ) aa ee) z,z EZ 
Consequently, for any h € L2(@), 
Ttn (2!) =SOT'(2',a)h(z) = SOT (ze) BAA), 2 €Z (10.6) 
z zeEZ 


Proof. We have by definition, 


(Tg, hyo = Yo{T9 (2) }h(z)@(z) = DOLD) T(z, 2) 9(2/)}a(2)@(z) 
zEZ zEZ 2/EZ 
Reversing the order of summation and introducing 1 = @(z’)/@(z’) gives (10.6): 


(Toho = > 92’) OT, 22) 2B) Ol?) 


2Z'EZ zEZ 


O 


While (10.6) conforms with undergraduate linear algebra intuition, it is not obviously useful for 
our purposes. 
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10.1.2 Adjoints and eligibility vectors 


The probabilistic representation of T’, provides a more useful representation of its adjoint. 


Proposition 10.3. The adjoint of T, admits the representation, for any h € Lo(@) and z € Z, 
(oe) 
Tih (z) = > *E[h(G(—A)) | &(0) = z] 
where ® is the stationary process on the two-sided time interval. 


Proof. Based on the definition (10.4) we have 


(Tyg, h)o = > @lz (Love )) | (0) = 2]) A(z) = Dorteoin(o )9(®(%))] 
zeEZ k= 


We have by stationarity Ea{h(®(0))g(®(k))| = Ea[h(®(—k))g(®(0))], so that 
(T,9,h)o = > y*Ealh(®(—k)) 9(¥(0))] 
k= 


The proof is completed on applying the smoothing property: 


Ea[h(®(—k))g(®(0))] = Eal[E[A(®(—k))(®(0) = S7Ealh((—k)) | 6(0) = z]9(z)@(z) 


zeEZ 


Thm. 9.11 (ii) Revisited This result follows from Prop. 10.1 with only notational changes. 
The vector pe in (10.3) has components 
BP = Ealvi(S(0))Q(®(0))] = (Tye vido = (Tibia, 1<i<d (10.7) 
Prop. 10.3 provides a more familiar formula: for any n, 
O° =Eolen(n], with cn =c(®(n)) and Gr = ¥ > 7*h(@(n — k)) (10.8) 
k=0 


with ® stationary, and defined on the two-sided time axis. The sequence {¢, : n € Z} is the 
stationary version of the eligibility vectors for the TD(1) algorithm. Hence, the formula 6* = 


R(0)-1b2 given in Prop. 10.1 corresponds to the equilibrium condition for TD(1): 


0= f(@*)= —R"0" + Eo [enSn] 


10.1.3. Weighted Norms and Weighted Eligibility 


The use of state weighting was introduced in Section 6.7.5 as a means to reduce variance in al- 
gorithms based on Monte-Carlo methods. Let w: Z > R+ denote the weighting function used to 
define the norm 

def 


Allow = EolH (®(n))?w((n))] (10.9) 
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We write H € L2(@,w) if this is finite (a vacuous assumption when Z is finite, as assumed in the 
theoretical development here). The inner product is redefined consistently: 


(G, H)ow = Ea[G(®(n))H(®(n))w(@(n))], GG, H € La(@, w) 
The adaptation of (6.49) to the current setting is minimization of the weighted mean-square error: 


2 
|Z? — Qllo.w = Eo [(H°(@(n)) — Q(@(n))) “w(F(n))] (10.10) 
The Galerkin approach to on-policy TD(A)-learning is modified analogously: 


0 =E,[{—H® (®(n)) + en + yH® (® 1))}w(®(n)) Gn] (10.11) 


with G, = pep (Ay)*(@(n — k)), and with ® stationary on Z. 
The development of Section 10.1 carries over to the new vector space Lo(@,w), beginning with 
the following: 


Proposition 10.4. The following hold for any function G € L2(@,w), with the following nota- 
tion: 


Be = (Wi, Qow l<i<d 


(10.12) 
R= Chews l<aj ed 


(i) The projection G exists, given by G= O*Twh, with 0* € R® any solution to the linear 
‘ wpe — 7G 

equation RYO* = w. 

(ii) Suppose that {vit are linearly independent in Lo(@,w): ||OTY law = 0 only when 6 = 0. 


Then, 0* = [R¥]~ 1g? is unique. Oo 


When applied to minimize the objective (10.10) we still have Q = Tyc, and hence 
1 =Qyleaw—Owoan, Taigse 
However, the definition of the adjoint depends on the choice of norm: 
Lemma 10.5. We have for all g,h € L2(@,w), 
(T19,h)o,w = >, Y*Ea[w(©(0))h(G(0))g((k))]_ = $5 *Eo[w((—k) )h(®(—k)) g(¥(0))] 
k=0 = 


Consequently, the adjoint of T, in Lo(@,w) admits the representation, 


oe) 


Tha (z FETw k))h(®(—k)) | ®(0) =z], zeEZ o 


= 


This motivates a new algorithm and consistency result: 


Pre-publication draft -- March 25, 2022 


CHAPTER 10. SETTING THE STAGE, RETURN OF THE ACTORS 364 


LSTD(1) with weighting (on-policy) 


With initialization Cp € RY, Eo e R&*¢ (positive definite), and time horizon N, 


On = Syd? (10.13a) 
- 1 oi 
: = T 
with Ey == (So i Le wnthin®y) (10.13b) 

go 1< 

Ra ek 10.1 

DN Wa ¢ (10.13¢) 
Cn = VGn-1 + UnY—n), Wn = w(P(n)), Lats (10.13d) 


The law of large numbers then gives 
Proposition 10.6. Under the linear independence assumption of Prop. 10.4 the LSTD(1) algo- 
rithm (10.13) is consistent: Jim n=l = R(0)-*p? with probability one, where R(0) = RY and 
— 00 


id are defined in (10.12) with G = Q equal to the fixed policy Q-function (9.47). Consequently, 0* 
minimizes the Lz objective (10.10). O 


10.2 Advantage and Innovation 


Recall from Section 9.1.3 one motivation for the advantage function: in policy iteration we don’t 
need a precise estimate of Q, since we are only interested in computing its minimum over wu for 
each x. Instead of Q we seek to estimate the difference Q — G where G: X > R (the function G 
does not depend upon wu). Prop. 9.14 tells us that the best choice is G* = Q = h. 

By “best” we mean that h is the minimal MSE estimate of Q over all functions on X. Conse- 
quently, the error V = Q — A is orthogonal to any function on X. In statistics, it is common to 
call V the innovations associated with the approximation of Q by h. This interpretation of the 
advantage function is tremendously useful for approximation of both Q and V, as it inspires better 
approximation architectures. 

Throughout this section we remain in the discounted-cost setting, so that h and Q are defined 
via (9.47). Extension of theory and algorithms to average cost is taken for granted in the remainder 
of the chapter. 


10.2.1 Projection of advantage and value 


Let’s step back and reconsider estimation of Q within a function class defined by a given basis 
w: X x U— R¢, expanding the function class in one of two forms: 


H = {OW +E :0,EER} or H* ={O+g9:0ER*, g:X>R} (10.14) 


where w and w were introduced in (9.66), followed by the function class H. We have Hc H* , 
where the latter contains every function that depends only on x € X. In particular, h € H*. 
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Here we also require the two d-dimensional function classes 
H={0TW:9ER} and H={OY:0ER4 


The interpretation W(X (k)) = E[w(®(k)) | X(&)] implies orthogonality of these two function classes, 
and much more. 

Denote the projections of Q onto the function spaces {H,H, H} by, respectively, Q, Q- and 
On. The projections of h and V = Q — h are denoted similarly. 


Proposition 10.7. (i) AnyGe H. is orthogonal to any function g: X + R: 
(G,9)@ = E[G(®(k))9(X(k))] = 0 


Consequently, H. and H. are orthogonal in Lo(@). 
(ii) For each g CH andGeH, 


IG +9 — Ql = 1G - V5 + Ig — Alle (10.15) 
(iii) V=V~ =Q~ andh=h =Q-. 
(iv) Q=V+h 
(v) Qx =V +h is the projection of Q onto H* Oo 


It is part (iii) that justifies the TD(A) algorithm (9.68). The proof of Thm. 10.8 is simply a 
restatement of Thm. 9.11 with the new basis. 


Theorem 10.8. The algorithm (9.68) is consistent: the estimates converge to the parameter w* 
that defines the projection V = {w*}Tw°. O 


10.2.2 Weighted norm 


If we seek an approximation in a weighted norm, then a convenient choice for G is the solution to 
the Lz optimization problem, 


C= peo lQ- Glow = =n Eal{Q(®(n)) — G(X (n))}>w(®(n))] 
where the minimum is over all G: X > R. The optimizer is characterized by orthogonality: 
(Q-—G",Gaw =0 for allG:X>R 


The proof of Prop. 10.9 is obtained on setting G = G', where G’(x) = 1{x = a’} for each i, where 
{x'\ is an enumeration of the state space X. 


Proposition 10.9. The optimizer is given by 


Ca) = ao S> o(u noe www), Ae) = S> o(u | x)w(a, wu) , rEX 
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If w does not depend upon u this gives 


Cn) = (2) =D O(u| 2)Q su), 2EX 


Hence, in this case, GY = Q =h, so that V = Q —h ezactly as in (9.65a). Oo 


The proposition tells us that the choice of basis w remains valid for approximating the advantage 
function, provided w does not depend upon u. The law of large numbers motivates an algorithm 
as seen in Prop. 10.6. Recall Section 9.5.4 for explanation of the basis w°. 

LSTD(1) for advantage with weighting (on-policy) 


With weighting function w: X — (0,00), initialization ¢) € R”, Fo € R™*™, and time horizon N, 


wy = Sx be (10.16a) 
with Sn= ~ (So a > wnid(ny Play ) (10.16b) 
n=1 
ae 3 it (10.16c) 
N n=1 


Gn = YGn-1 + WnP(n) » Wn=WX(n)), vay =V°(O(n)), l<n<N _ (10.16d) 


10.3. Regeneration 


The representation (6.26) for Poisson’s equation admits a partial generalization to the discounted 
cost criterion. Let z* = (x*,u*’) € X x U denote any state with positive steady-state probability: 
@(z*) > 0. Consider the function introduced before (9.57): 


tel 
z= E| > »*e(&(k)) | (0) = z| # E| 1" “yoke C((k +7~)) | (0) = z| 
k=0 
Apply the Strong Markov Property (see Appendix A.2.3) to obtain 
E[y Sata D(K + 74)) | (0) = 2] = E[y*O,(B(%)) | ®(0) = 2] = O,(2)E Ly” | B(0) = 2] 
A similar se eee holds for Q,: 


Lemma 10.10. The following hold for y € [0,1): 
Te—1 


2) =E| > ¥*e(@()) | (0) = z] + Q,(2)Eh™ | BO) = 2] 
k=0 


Te—1 


=E[S> ye((k)) | ®0) = 2] + Q(Z)EL* | 80) =z] 
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Consequently, tim Q,(z) = H3(z) + Q(z’), where H3 is the solution to Poisson’s equation, 
Y 


Te—1 


H3(z) = pS &((k)) | (0) = z (10.17) 


k=0 


For y ~ 1 we might opt to estimate the finite-horizon objective to obtain an algorithm with 
reduced variance. Consider for fixed policy ©, 


Tel 


Jn(z) = E] D> ye(®(K)) | BC) = 2] 
k=0 


Te—1 

On writing J,(z) = c(z) + E[1{7. = 2} y¥ce(®(k)) | (0) = Z|, the following DP equation is 
k=1 

obtained: 


Ife) =z) +4 5” PE 2')Jy(2) 
ae (10.18) 


= Elen + {O(n +1) F 2} J,(©(n + 1)) | ®(n) = 2] 


The finite-horizon objective invites a new eligibility vector: for n > 0, 


Gr = me (Ay) Fay (10.19a) 
all <ke<n 
where oll = max{k <n: ®(k) = 2°}. (10.19b) 


In the special case ®(k) 4 z* for k = 0,...,n (ie., the maximum in (10.19b) is over an empty set), 
we define oft = (0. The sequence of eligibility vectors has a recursive form, which together with 
(10.18) motivates the regenerative TD(A) algorithm: 


Regenerative TD(A) algorithm (on-policy) 
For initialization 09 , Cy € R4, the sequence of estimates are defined recursively: 
Ont = On + OntiGnPnt+i 
Dnai = (—H®(®(n)) + en + {O(n +1) Fe} H8(G(n + 1)| (10.20) 


0=8n, 
Gat = AY1{ O(n a 1) # Z4hCn +r V(n-+1) ? n20 


With the introduction of regeneration, the algorithm remains practical even when y = » = 1. 
For a linear function approximation this is a linear SA recursion 0,41 = On + Q@n41[An+19n — bn+1] 


with 
Any = Gn[—V(ny + YALO(n +1) F 2 Wnty] | 


10.21 
bn41 = —Gnen ( ) 


The value of A = 1 is explained in Thm. 10.11, whose proof can be found in Section 10.9. 


Pre-publication draft -- March 25, 2022 


CHAPTER 10. SETTING THE STAGE, RETURN OF THE ACTORS 368 


Theorem 10.11. (ZL2 Optimality of TD(1)) Consider the algorithm (10.20) with linear 
function approximation H® = OT). Assume that ® is uni-chain and that @(z*) > 0. 

Then, in the special case X = 1, 

(i) AS Ea[An] = —R(0) and b = Ea[bn] = —Ea[Jy(®(n))y(®(n))]- 


(ii) Any solution to 0 = f(0*) = Ea[GnPn+1] solves the minimum norm problem: 


Gre arg min |H° — J,\|2, 2 arg min Ea [(H°(®(n)) — Jy(®(n)))”] (10.22) 


As in all TD(1) algorithms with linear function approximation, the characterization of A 
strongly motivates the use of LSTD(1). One formulation is given by 


st? = A", (10.23) 


where N denotes the time horizon, and for given Ro > 0, 
1 a ie 
Mx D T ie 
A=—F{Ro+ Dodoo} b= — 5 Lo uen 


10.4 Average Cost and Every Other Criterion 


10.4.1 Every other criterion 


When we optimize over a family of policies, as described at the start of this chapter, we cannot 
expect to find a single policy in the family that minimizes the discounted or total cost criterion 
from each initial condition. It is customary to instead choose a pmf u on Z = X x U, and define 
the objective for optimization as follows: 


1(6) = So hg (2) ule) 


where h $ is the value function associated with the policy ry for the chosen optimality criterion. In 


v6 
Section 10.5 the family is defined through a finite dimensional parameterization {@ : 6 € R®}, and 


we write ['(@) rather than r(6’). 

In the following we show how to translate from one optimality criterion to another. Translation 
is accomplished by the creation of a Markov chain W, a strictly increasing sequence of times 
{N;,:n > 1}, and a modified cost function ¢, all designed so that the partial sums 


Nn4i-1 
S.= SGU), nel (10.24) 
k=Nn 


are ii.d., with common mean r(o). It will follow by construction that r(o) is proportional to 
average cost for the newly constructed stochastic process. 
There is no need to identify which policy is under consideration in the constructions that follow, 


u 


so we write T instead of ['(), and let T denote the transition matrix for the Markov chain ®. 
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Discounted cost Take fh equal to the discounted-cost value function obtained with a policy &, 
and denote 


P= So az)ulz) =E[S> *e(@(R))], BO) ~ w (10.25) 


k= 


(=) 


n 


The construction is defined through regeneration: U(k) = (®(k), B(k)) in which B(k) € {0,1} for 
each k, with B(k) = 1 indicating that a regeneration occurred at time k, and ®(k) evolves according 
to the transition matrix T in-between regeneration times. 

We begin with the construction of the first regeneration time along with an alternative repre- 
sentation of (10.25). Let B be a Bernoulli process with parameter 1 — y, independent of ®: 


P{B(k) =0| f°} = 7 


where the conditioning above is on the entire trajectory of the Markov chain ®. On denoting 
T. = min{k > 1: B(k) = 1} we have P{T, > k | ®3°} = 7* for k > 0, and by independence 


E(1{T. > k}c(®(k))] = ElM{T. > k}JE[e(®(k))] = 7“ Ele(®(k))] 


Summing each side removes the discounting from (10.25), converting it to a stochastic shortest 


path problem: 
Te—1 


h(z) = E|>~ c(®(k)) | 8(0) =z (10.26) 


k=0 


To define W we let Ni = T. define the first regeneration time, and ®(k) = @(k) forO<k< 
T. —1. The random variable ®(T,) is sampled independently of the past, with distribution p. This 
construction is repeated, with {V,,} a renewal process, defined inductively by 


Nati = min{k >N,+1: B(k) =1}, vu >1 


with ®(k) defined as above on the interval {N;, < k <.Ny+1} for each n. 
The following is a variant of Kac’s Theorem for this Markov chain (see Prop. 6.10): 


Proposition 10.12. For the Markov chain © with cost function c(W(k)) = c(X(k)), k > 0, the 
partial sums {S;,} in (10.24) are t.i.d. with common mean T defined in (10.25). Consequently, the 
average cost is given by 


i N-1 . M1 M-1 
fin, 5 OO) = marge De Sn : 


Finite-horizon Let NV > 1 and z* € Z be given, and consider the finite-horizon criterion 


NAN. 
re E| DB ye((k))| , &(0)<p (10.27) 


where NM, = min{k > 1: ®(k) = 2}, NAN = min(WV,N,), and y > 0 is arbitrary. This is the 
weighted shortest path problem when NV = oo and y = 1. 

To construct a Markov chain W for which (10.27) is proportional to average cost requires a 
different regeneration construction. We again define W as a pair process V(k) = (®(k),v(k)), where 
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in this setting v is defined to be deterministic and periodic: 1(k) = k (mod N’). The regeneration 
times are also deterministic: Nj, = nN for n > 1, and the construction is defined so that the 
sequence {®(N;,) :n > 1} is iid. with marginal u. 

We borrow ideas from the discounted cost setting to construct W: the state space Z is enlarged 
to include a graveyard state denoted a. Hence the state space for W is {ZU a} x {0,...,M — 1}. 
The dynamics on each interval N;, < k < N,41 are defined as follows: 


(i) P{®(k) = a4 | O(k —1) = 27} = P{O(k) =a | O(kK—1) =a} Hl 
(ii) P{®(k) = 2! | O(k —1) = 2} = T(z, 2’) for z, 2’ € Z, whenever z ¥ 2’. 


As in the discounted setting, there is a simple interpretation that lends itself to simulation or 
experimental design: For each n, initialize ®(N;,) ~ w, independent of {®(k) : k <.N,}. Obtain 
samples of the state process according to the natural dynamics for NV, < k < Nnii, stopping the 
experiment or simulation if z* is reached during this interval. 
The cost function is defined by @(a,v) = 0, and @(z,v) = y¢e(z) for z € Zand € {0,...,N—1}. 
An analog of Prop. 10.12 is obtained through this construction: 


Proposition 10.13. The partial sums {S,,} in (10.24) are i.i.d. with common mean T, now 
defined in (10.27). Consequently, the average cost is given by 


i 3 E[e(@(k))] = —T 0 
k=1 N 


Noo N 


Prop. 10.13 also applies to the truncated discounted-cost criterion: 


N 
r=E[S y*c(O(e))], 80) ~u 
k=0 


This may be preferred to the infinite-horizon objective (10.25), since the use of deterministic 
regeneration times will likely lead to lower variance. 


10.4.2 Average cost algorithms 


Both regeneration and relative DP equations are used next to construct algorithms designed to 
estimate the solution to Poisson’s equation (9.56). Regeneration motivates the particular represen- 
tation H3 given in (10.17), which is the unique solution satisfying H3(z*) = 0. 

We begin with an algorithm inspired by the regenerative algorithm (10.20) and Thm. 10.11: 


Regenerative TD(,) algorithm for average cost (on-policy) 
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For initialization 09 , ) € R®, the sequence of estimates are defined recursively: 
Ont = On + OntiGrnPn+1 
Dns = (-H9(@(n)) +n + 1{B(n + 1) #7 HH*(@(n + 1)))| 
Cn41 = AL{®(n + 1) A 2 hn + Venn) 


M+1 =n +€n/(n+1), En = C(®(n)) — In n=0 


=n (10.28) 


This is a linear SA algorithm based on a slight modification of (10.21): 7 = 1 in the definition 
of Anyi and bp41 = —Gr[en — Mm]. The associated ODE is thus linear, with vector field 


f(8)=Ad+6, A=EQl[An], 6 = —Ealne(®(n))] 


Theorem 10.14. (Zz Optimality of TD(1) for average cost) Consider the algorithm (10.28) 
with linear function approximation H® = 6T. Assume that ® is uni-chain and that @(z*) > 0. 
Then, in the special case A = 1, 


(i) A = —R(0) 
(ii) Any solution to 0 = f(0*) = Ea[CnPn+1] solves the minimum norm problem: 
0* € arg min || H° — H3||2, = argmin Eg [(H*°(®(n)) — H3(®(n)))”] (10.29) 
6 6 
where H3 is the solution to Poisson’s equation given by 
Te—1 
H3(2) = E] )> &@(k)) | (0) = | a 
k=0 


Once again, in most cases it is best to use the LSTD(1) formulation (10.23) if the function class 
is linear. 
It is anticipated that an algorithm derived from a relative DP equation will have lower variance. 


Consider 
0 = E|-A(®(k)) — 6(p, H) + c(®(k)) + H(®(k + 1)) | (kK) = Zz], zEXxU (10.30) 
The function H is the unique solution to Poisson’s equation for which 6(u, H) = n. Consequently, 


H(z) — H(z) = H(z), 2e€XxuU (10.31) 


Given the foregoing we have a natural candidate for approximation, in which estimation of 77 is 
abandoned: 


Regenerative relative TD(\) algorithm for average cost (on-policy) 


For initialization 49 , Cy € R®, the sequence of estimates are defined recursively: 
eee = 05 On+1GnPn+1 


Dna = (—H9(®(n)) — 5(u, H®) + ep + H9(G(n + 1) Lico (10.32) 
Cnt = AL{O(n + 1) F 2" fon + Vent) 
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We now have 0n41 = On + An41[An+19n — bn4i] with 
An+1 =Cn [=n = dap ae Wn+1)|" On41 = —CnCn (10.33) 
where 7} = (u,~;) for each i. We have the following companion to Thm. 10.14: 


Theorem 10.15. (L2 Optimality of TD(1) for Average Cost) Consider the algorithm 
(10.32) with linear function approximation H® = 6Tw. Assume that ® is uni-chain, that @(z*) > 0, 
and that for some parameter vector 0° € R¢, 


SObi(z) =1{z=2}, 2EXxU 


Then, in the special case \ = 1, for any solution to f(0*) = 0, 


(i) 7 =d(u,H”). 
(ii) The “projected Poisson’s equation” holds: with Y = w(®(n)), 


H® (®(n)) = Ele(®(n)) + H® (®(n + 1) | Y] 
(iii) Suppose in addition that 1 is in the span of the basis: for some 6! € R¢, 
So Oivi(z)=1, 2EXxU 


Then, H® (®(n)) = E[H3((n)) + H® (z*) | Y], and the minimum norm problem is solved in 
the following span seminorm: with r* = H® (z*), 


(ur )s arg a Eo [(H®(®(n)) —r — H3(®(n)))”] (10.34) 


O 


Unfortunately we lose the elegant expression A = —R(0) for this algorithm. A glance at the 
proof reveals that instead, 


A=Eo [e [—vdin) _ byt = n+) | 
= —6E@ [Cn] {0"}" + Eo [day {—vny + V2) }] 


It can be shown that A is invertible for sufficiently small 6 > 0 provided ~'R(0)~!(z*) < 1. 
Moreover, 


BRO) *b(z") < Vb(z*)TR(0)-14(z") 
which might lead to choices for basis selection to ensure the right hand side is no greater than 
one. Conditions to ensure that A is Hurwitz are not yet available, so this approach may be best 
implemented using stochastic Newton-Raphson, implemented as LSTD(1). The estimate at final 
time N would then be defined by (10.23) where the ‘hats’ represent sample path averages of { An, b,, } 
defined in (10.33). 
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10.5 Gather the Actors 


We met the actors briefly at the start of this chapter, defined to be a family of randomized policies 
vO 

{od : 6 € R%}. It is assumed henceforth that they are continuously differentiable in 6. Examples 

include the Gibbs policy (9.46a), and the linear family: 


d 


o%(u| x) =) 6:6*(2) (10.35) 


i=1 


where {’ : 1 < i < d} is a pre-selected family of deterministic policies, and the parameter is 
constrained to be non-negative and sum to unity: @ € R4 and )°,0; = 1. The linear family 
(10.35) can be regarded as a compression of the input space, replacing U with the set of d indices 
I= {1,...,d}, with 6; interpreted as the probability of choosing index i. 

Theory in Section 10.4.1 allows us to restrict exclusively to the average cost criterion throughout 
the remaining sections of this chapter, since other optimality criteria can be converted to average 
cost through the introduction of regeneration. This is convenient to simplify discussion, and also 
because we can build on ideas from Section 6.8 concerning Sensitivity and Actor-Only Methods. 


10.5.1 Actor-Critic for average cost 
To apply the sensitivity formula in Thm. 6.8 requires the representations 
v6 
cola) = Sb (u| x)e(x,u) 
U 


r (10.36) 
Piao) =) bo @| Ria) z,2'€X, 0ER? 


We also require notation for the pair process ® = (X,U). Its transition matrix and invariant pmf 
are again given by (9.3). In the current notation, these become 


def 


Ty(2,2!) = Palea!)b (ul | a!) @o(2) roe) 6"(u | 2) (10.37) 
for any z = (x,u) and z’ = (2’,u’). Our goal is to minimize average cost: 


Actor-Critic Objective. 


P(8) = ¥) co(x) mo(x) =) e(z)@o(z) 


LEX zEZ 


It is assumed throughout that the invariant pmf 7 is unique for each 6. 


In view of the definitions of cg and Pg, the sensitivity formula (6.52) in Thm. 6.8 requires partial 


v6 v@é 
derivatives of d with respect to 6. The gradient of the logarithm of @ plays an essential role: 


A®(x,u) = Vo log{b (u | 2)] (10.38) 
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In particular, we have 


Voco(«x =e (u | x ,u)e(z, u) 


(10.39) 
VoPo(x, 2’) =v (u | x)A°(ax, u)P, (x, 2’) , a,c’ €X, OER? 
Perhaps more fundamental is that A° is the score function for the transition matrix on Z: 
A®(2') = Volog(Te(z,2')), 2,2'€Z (10.40) 


This formula follows from the definition (10.37). 

The reason that actor-critic methods appear here, right after TD(1)-learning for average cost, 
is that the sensitivity formula in Thm. 6.8 can be expressed in terms of the fixed-policy Q-function. 
Denote for any 6, 


Qo(x,u) = c(x, u) + Pyhg (x) = c(x, u) + > P(x, x')he(z’) , rEX,ueu, (10.41) 
x'EX 


where hg solves Poisson’s equation, cg + Pehe = he + T(6). 
Theorem 10.16. Under the assumptions of this section, for each 6 € R¢, 
VI(9) = Ea, [A’(®(k))Qo(®(k))] (10.42) 
Proof. The function Qg solves Poisson’s equation for ®, with cost function c: Z —> R: 
E[Qo(®(k + 1)) | ®(k) = 2D Pal a,x’) (ul | 2'){e(a',u’) + Pyhg (2')} 
= 55 Pulx,2’){co(2’) + Poho (2’)} 
= S7 Pulz,2'){ho(a") + (6)} = Qo(w, u) — (a, u) + 7(8) 


Written in matrix notation, this is TgQ9 = Qa — c+T (8). 
This combined with (10.40) and Thm. 6.8 completes the proof. O 


The theorem invites many questions: 


(i) How can this be used for optimization? Stochastic approximation is an option: 


On+1 = On —AnsiVe(n), — Ve(n) = A*(®(n)) Qo, (®(n)) 


This is a version of stochastic gradient descent (SGD). The function A® is known, since we 
have constructed the policy. The Q-function is not known, and a poor estimate would mean 
poor approximation of 0*. 


(ii) Even if Qg were known, the stochastic approximation algorithm can be expected to have 
large variance. How can the variance be tamed? 
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The questions are addressed one-by-one, and justify new conclusions and algorithms. Each 


v9 
algorithm requires two function classes: one to define the family of randomized policies {@ : 0 € 
R“} and a second function class {H? : 6 € R%} to define the approximations for the Q functions. 
The following assumptions are imposed so the Lz theory from the previous section will be available: 


Actor-Critic Basis. A linear parameterization for {H? : 9 € R¢} is assumed, with fixed 
dimension d'. The basis functions may depend upon 6, so that a generic function in H? can 


be expressed 
HY = wo, weR?, eR? (10.43) 


It is assumed that wg is continuously differentiable and Lipschitz continuous in 6. 
We will see shortly that d’ > d is usually desirable. 


Actor-Critic Algorithm 


For initialization 0) € R¢ and wo ,¢) € R®, 


Ont =9n —An4iVr(n), — Vp(n) = AP (G(n)) Hy?” (®(n)) (10.44a) 
O®(n+1)~ 7), (z,-), with z = ®(n) (10.44b) 
Dns = {-Ae (O(n) + & + YG(n +1) F A} AP (O(n +1))F] yy, 

Wn41 = Wn + PraiG. Pr (10.44) 


Gn41 = AL{O(n + 1) F 2" }on + Von4s (O(n + 1) 


Mm+1 = Nn + Bn+1€)n ; Cn = c(®(n)) — Mn 


The set of equations in (10.44c) is based on the TD(1) algorithm (10.28) (on setting A = 1). 
This version of TD(1) is favored because there is a firmer stability theory as compared to (10.32). 
The algorithm has two different step-sizes, satisfying the standard assumptions: 


ioe) ioe) ioe) 
Sen => Busco; So, +62} <0 
n=1 n=1 n=1 


It is assumed that the latter is much larger than the former, so that it is possible for H%" to track 


’ On 
the estimate of the fixed policy Q-function Qg,, (associated with the policy ¢ ~ ). 
The following would be expected from the theory of two time-scale SA: 


Proposition 10.17. Suppose that X= 1 and the step-size sequences satisfy assumption (8.22): 
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Assume moreover that the parameter estimates are bounded, and the following consistency condition 
holds: * 
for each 6 € R” there is a ws € R® satisfying A = Qo. (10.45) 


Then, the ODE approximation of (10.44) is gradient descent 49 = —VI(8). O 


The consistency assumption (10.45) is unrealistic outside of the tabular setting. Removing this 
assumption is possible through an application of Thm. 10.14—details are provided in Section 10.6. 


10.5.2 A few warnings and remedies 


It is time to say a few words about the focus on randomization here, which wasn’t required in 
Sections 4.6 and 4.7. Randomization is required if we want to apply sensitivity theory for Markov 
chains, which brings in the score functions S® and A®%. In particular, there is no meaningful 
definition of A® for deterministic threshold policies, such as the policy proposed for Mountain Car 
in Section 4.7.1.!° In most applications it is reasonable to abandon randomization in the final step 
of policy design: once we have our parameter estimate 6x 6*, construct a deterministic policy: 


is v6 
"(x)= argmax (u| x), ZEX, 
U 


v6 
We might expect that will be nearly deterministic if it is nearly optimal, in which case *™' is 


v6 
a small perturbation of @ . The example that follows illustrates this point. 


Example 10.5.1. The best parameter is probably oo 


Consider an entirely ideal setting in which d = 1 and 6@ plays the role of “inverse temperature” in 
the Gibbs policy (9.46a): 


b (u |x) = exp(—0H(z, u)) 


1 
K(x, 0) 
where H: Xx UR, and k(z,@) is a normalizing constant. Suppose that we are so fortunate that 
the optimal policy is obtained from H: 


o*(x) = arg min H(z, wu), LEX 


and that * is unique (no other policy is optimal). Unfortunate conclusions follow: 


v6 u@ 
(i) @ is not an optimal policy for any 0, and (ii) jim b =o". 7 
— oo 


This example might seem contrived, but Gibbs policies will be featured in the coming sections, 
and we will see in Section 10.6 that that there is good reason to include an approximation to the 
Q-function in a Gibbs policy. 

The example suggests that a good algorithm must allow for 6, to converge to co. This may not 
be practical, so instead we introduce regularization: choose a convex regularizer R: R4 > R,, and 
modify the actor-critic algorithm to approximate regularized gradient descent: 


49 = —VT (9) — VR (8) (10.46) 


See [323] for alternative formulations of actor-critic methods for deterministic policies 
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so that (10.44a) is replaced with 
On41 = On, — ansi{Ve(n) + VRn (On)} 


where R,,: R¢ > R, is potentially random, with E[VR, (0)] ~ VR (A) for each @ and all large n. 


10.6 SGD Without Bias 


The ODE approximation in Prop. 10.17 can be obtained under assumptions far weaker than (10.45). 
A function class {H° : 6 € R®%} of candidate approximations to the Q-functions is said to 
satisfy the compatible features property (CFP) if 


Meet for each 90 € R4 andl <i<d (10.47) 


Proposition 10.18. Suppose that {H° : 6 € R%} satisfies the CFP (10.47). For given 0 € R?, 
let Q denote a solution to the minimum norm problem 


Q € arg min{||H — Qo||2,, : H € H°} 


Then, 
VT (8) = Eos [A((4))Qo((K))] = Eas [A°(®(K))O((K))] (10.48) 


Proof. Lz optimality is equivalent to the orthogonality property: 
0 = Eo, [{Qo(®(k)) — Q(®(k)) }H(®(k))], for all HEH? 


The identity (10.48) is obtained on setting H = A? for each i. O 


The practical importance of the proposition is seen on revisiting Thm. 10.14: for fixed 0 and 
with U(k) = @ (X(k)) for all k, the identity (10.48) will hold for the approximation obtained using 
TD(1), whenever the linear function class satisfies the CFP. 

This assumption is not restrictive. In practice we might start with a function class 1°, and 
then for each @ define 


d 
Ho ={h=R+DowAl: Wen, wert (10.49) 
i=l 
While Thm. 10.14 as stated is only valid when @ is independent of n, the theory of two time-scale 
SA gives us the following extension of Prop. 10.17: 


Proposition 10.19. Suppose that the assumptions of Prop. 10.17 hold, but with (10.45) replaced 
with the compatible features assumption (10.47). 
Then, the ODE approximation of (10.44) is unchanged: 49 = —VI(8). O 
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We can disregard any part of Ne that does not depend upon u. Let #, denote the partial 
history up to time k: 
Fy ={X(k), (4) :0<t<k—-1} (10.50) 


This is the entire history up to k, except that U(k) is disregarded. 


Lemma 10.20. For any initial distribution for X(0) we have E[A°(®(k)) | F, ] =0. 
Consequently, 


(i) {A°(®(k)) :k > 0} is a martingale difference sequence. 
(ii) For any function g: X > R, 


0 = Elg(X(k))A*(®(k))] 
Proof. From the definitions we have for any k and x € X, 


EIA’ (®(k)) | Fes X(k) = Nee Cl) 


Given the definition A°(x,u) = Vob (u | a) /b (u | x), it follows that 


E[A’(®(k)) | Fes X(k) = vob (u| 2) =Vo >> 6 (ul 2) =0 


6 
where the final equality holds because @ (- | x) is a pmf on U for each z. 
The smoothing property of conditional expectation gives 


Elg(X (k))A®(®(k))] = Elg(X (k))EIAY(X (k), U(k)) | Fe ]] = 0 0 


Lemma 10.20 implies that we can relax the definition of compatible features to read: for each 
6 € R¢ and 1 <i < dz, there is a function G?: X > R such that 


M=G eH’ (10.51) 


Gibbs policy Given a d-dimensional basis vector w°, consider for each 6 the policy 


y 1 
b (u \z)= <0,m) exp (077° (a, u)) (10.52) 
where k is a normalizing constant (recall (9.46)). We then have from the definitions, 


Lemma 10.21. For the Gibbs policy, 


Ao (x,u) = ho(a,u) = (a, u)—V2(x), Wa) = > & (uv | x(x, v) o 


Consequently, for this policy we can take d! = d and H? = {wp° : w € R%} to ensure the CFP 
is satisfied in the relaxed form (10.51). However, in the next subsection we see that it may be best 
to use H? = {wg : w € RY}. 
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10.7 Advantage and Control Variates 


First, and most important: don’t be fooled by the beauty of unbiased gradient observations. The 
ideal algorithm using TD(1) may come with massive variance. In practice you are likely to resort 
to the introduction of A < 1 in TD(A) along with the application of state weighting, so that the 
eligibility vector in (10.44c) is replaced with 


Gn41 = AL{®(n) F 2" }on + Wnt 1 V0.4: (O(n + 1)) 


where Wn41 = w(X(n+ 1)) for the weighting function w: X > R4. 

You may go further, and experiment with the discounted-cost value function as an approxima- 
tion for Qg. This is unfortunate, but bias/variance tradeoffs are a theme in machine learning that 
appear to be inescapable. 


The following pages describe techniques to reduce variance without bias. 


10.7.1 Variance reduction through advantage 


Lemma 10.20 tells us that we can construct a second family of functions G, where G: X > R for 
each G € G, and replace (10.44a) with the following: 


Ont = On — Angi A (®(n)){ He” (®(n)) — Gn(X(n))} (10.44a’) 


Provided that the CFP holds, we will maintain the ODE approximation as gradient descent, re- 
gardless of how {G,,} are defined (subject to continuous dependency on parameter estimates). The 
favored choice of G is the same as identified in Section 9.1.3: 


Go (x) = E[Hg(®(n)) | X(n ee ru) (u | x) 


7) 
The difference Qg — hg was designated the advantage function associated with policy  . 
A modified function class HE, is defined such that any Vg € HY, can be expressed 


Vo(x,u) = Ho(x,u)— Ho(x), for some Hp €H°, with Ho(x I 2a xr,u)d (uw | x) 


And for any w we write V;°(a, u) < A, Gi) =i, (a) = wlie(x,u), with 
bo (x, u) = Yo(z,u) — ¥, (2), = 2 val (xz, u)h (u| x) 


We arrive at an update equation that is preferred in recent research: 
vu 


Bn =O —AnsiVe(n), — Ve(n) 2 A%(B(n)) V2" (B(n)) (10.44a*) 


This form comes for free when using wg = A®, since in this case = 0 is obtained as an application 
of Lemma 10.20. - 

The introduction of {A% (®(n)) He" (X(n)) : n > 0} to obtain (10.44a*) from (10.44a) is an 
example of the control variate technique that was introduced in Section 6.7.5. 
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10.7.2 A better advantage 


Consider a function class satisfying the CFP. Suppose that the input space is not large, so that 
it is practical to compute H#(x) based on Hj?(«,u) for each observed x € X. In this case there 
is an alternative control variate available, defined by a different sort of smoothing. It is no more 
complex, and has lower variance. 

For any 6, w and x denote 


AH@ (x) = E[A°(®(n)) Hp’ (®(n)) | Fy 3 X(n) = 2] 
with {F, } defined in (10.50). Applying the definition of the conditional mean using é gives 


AH? (2) = S Vb (ul a)HP(a,u), 2 EX 


From the smoothing property of conditional expectation we obtain a new unbiased SGD algorithm 
by “smoothing” (10.44a): 


def 


Grit =On—Onrive(n), — Vpln) # AH®*(X(n)) (10.44a**) 
We will see that Vp(n) has lower variance when compared to Vp (n) in (10.44a*). 

More important is a comparison of the respective asymptotic covariance matrices ia, as defined 
in (8.27). The first step in this comparison is to consider the gradient estimates evaluated at the 
optimal parameters. Since VT (6*) = 0 for any limit 0* of either algorithm, the two processes below 
have zero mean for the process ® in steady-state: 


An (n) = Vp (n) = A® (®(n))Vo2" (B(n)) 
Ag? (n) = Vp (n) = AHF (X(n)) 


The respective asymptotic covariance matrices are denoted 


ERO = So Lal (ny{Ve™(O)}] and ERP = So Eel Vp (n) {Ve (0) }] 


n=— CoO n=—CcoO 


Proposition 10.22. We have for any n, 


YV,0O v$,00 


Ve (n) = Vp (n) + AR 
where {AA} is a martingale difference sequence, satisfying E[AA | F~] = 0. Consequently, 
(i) The covariances are ordered: 


Cov(V;" (n)) = Cov(Vp™ (n)) + Cov(AA) > Cov(Ve"™ (n)) 


(ii) The asymptotic covariances are also ordered: SK? = SX + Cov(Af). 
Part (i) is not helpful for understanding variance of the respective actor-critic algorithms: the 
ordinary covariance is not of primary interest in convergence theory for stochastic approximation. 


Part (ii) tells us that the alternative control variate approach is preferable whenever Cov(Af’) is 
large. 
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Proof of Prop. 10.22. We have by definition 
Aj SA (B(n))Vg2 (B(n)) — AH§" (X(n)) 
The conditional expectation is zero: 


E[A2 | Fn] = EIA® (®(n)) Vor" (®(n)) | Fp] — ALG" (X(n)) 
= E[A® (®(n)) Hg (®(n)) | Fn] — AHG (X(n)) 


where the second equality follows from Lemma 10.20. The conclusion E[A4 | F~] = 0 implies 
that {A4} is a martingale difference sequence. Part (i) follows because V- (n) = AH w"(X(n)) is 
measurable with respect to F,, so that A4 and VY; (n) are uncorrelated. 

For (ii), observe that the martingale difference property implies that AA and VY," (0) are 
uncorrelated for n > 1, giving 


EalVe (n){Vp (0)}7] = EolVe (mn) {Vp (0) }] 


Taking transposes of each side, we obtain the same identity for n < —1. Hence the two auto- 
covariance sequences are identical for n 4 0. O 


10.8 Natural Gradient and Zap 


What about the Newton-Raphson flow? If we had direct observations of the gradient this would 
become 
£9 =—-G(9)VI(9), (10.53) 


with G = [V7I]~!. Thm. 10.16 might provide a means to obtain unbiased estimates, but a tractable 
Zap-SA algorithm is not yet available. Moreover, this approach could compute a local maximum 
rather than local minimum of I. 

There is an alternative choice of matrix gain that is popular, and defines the natural gradient 
algorithm. 

To set up notation, let’s first review theory surrounding approximation of the critic Qg. For 
the fixed policy setting with linear function approximation, Thm. 10.14 tells us that the optimal 
matrix gain is given by —-A~! = R(0)~', with R(0) the auto-correlation matrix for the basis. 

Consider the minimal function class with compatible features using wg = A°(a, wu). We have seen 
that this has mean zero for each 0, so that the auto-correlation coincides with the auto-covariance, 
and the notation R(0) is abandoned in favor of 


F(0) = S°A% (a, uA? (x, u)™@o(a, u) (10.54) 


zu 


This is known as the Fisher information matrix because of its association with a matrix of this 
name appearing in statistics (see the close of Section 10.10 for further discussion). 

For fixed-policy TD learning with linear function approximation, the asymptotic covariance can 
be optimized using stochastic Newton-Raphson, and the same is true here for this two time-scale 
algorithm. The matrix gain G(@) = F-1(@) is also used in the definition of the natural gradient 
algorithm, whose ODE approximation is (10.53) with this choice of G. 
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Look over the next algorithm carefully: the matrix gain G,, that approximates F—!(0,,) appears 
twice. Its first appearance is used to approximate the natural gradient, and it appears again in the 
update for w,, (a version of stochastic Newton Raphson). 

The variance might be reduced using the control variate technique described in Section 10.7.2, 
replacing V;(n) in (10.55a) by Vp(n). 


Natural Actor-Critic Algorithm with Zap 


The function class is defined using Wg = A°. 
For initialization Ro > 0 (d x d), no € R and %, wo, Go € R¢, 


Onti = On —On4iGnVe(n), — Vp (n) = AnHer(@(n)), Gn = Rj! (10.55a) 
An = AP (8(n)) (10.55b) 
O(n+1)~ 7%, (z,-), with z= B(n) (10.55c) 


Dn4i = —Hg" (®(n)) + Gy + 1{ O(n + 1) F 2°} Hg" (®(n + 1) 
Wn+1 = Wn + Bn4iGn6nPn+1 
Cnt = AL{®(n + 1) F 2" ben + Yon ys (O(n + 1)) (10.55d) 
Tn+1 =n + Br+i€n41 » Ent = c(®(n+1))—™m 
Rati = Rn + BnsilAntiAli,s — Bn] 


The inverse that defines the gain matrix G,, = R,! can be computed efficiently using the Matrix 
Inversion Lemma (A.1), or an algorithm can be obtained without matrix inversion by adapting the 
first-order Zap algorithm (8.52). 

Do not forget the warnings in Section 9.5.4: if R, is never invertible, then you will need to 
either prune the basis, or use a pseudo-inverse to define Gy. 


10.9 Technical Proofs* 


The proofs of Thms. 10.11 and 10.15 are similar to the proof of Thm. 9.7, except that we need to 
understand the impact of regeneration. Recall that ® is assumed defined on the two-sided time 
axis. For \ = 1, the steady-state realization of the eligibility vectors is defined by 


r= DP a *W(@h)), neZz (10.56) 
all <ke<n 


That is, we allow negative values of n and k (recall (10.19)). 
The following solves half of the “regeneration puzzle”. For any function g: X x U > R and 
7 € [0,1] denote g = Tyg, with 


Tel 


a2) = E] D7 v*9(®(A)) | ©) =z 
k=0 
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We have g = J,(z) ifg =c, and g = H3 ifg =¢ and y= 1. 


Lemma 10.23. The adjoint of T:, satisfying (T29,h)o = (g,[T:]'h) for each g,h € L2(@), is 
: 7 7 7 
given by 


(T3]'h (2) = | S> yr Fh(@(n)) | O(n) = 2], 2 EXxU, neZ. 


where {®(k) :k € Z} is a stationary version of the Markov chain. In particular, with G, defined in 
(10.56), 


Ex[9((n))¥(P(n))] = Ealg(P(n))Gn 
Proof. It is enough to establish the identity for n = 0. From the definitions, 
(L3g,h)o = Ealh(®(0))9(®(0))] = Ex noone[S + g((R)) | (0) | 
The smoothing property of conditional expectation gives : 
Eo[h((0))9((0))] = Ea |h(¥(0)) > +g((k))| 


and then by stationarity and the definition of 7., 


Ec [h((0))9(®(0))] = Exl(¥(0))9(@(0))] + D> "Eo [1{B(i) A 2 = 1 <j < k}H(G(0))9(@(4))) 
k=1 


= Ealh(®(0))9((0))] + D> Eo |1{ Bj — h) A 21 <j < h}H(B(—K))9(¥(0))| 
k=1 
Make the change of variables € = j — k, so that 1{®@(j —k) A2z:1<j<k}=1{O(%) #2: 
—k+1< <0}. Adopting this change, and then returning the sum and the coefficient y* within 
the expectation gives 


Ea[h(®(0))G(®(0))] 


where the sum is defined to be zero when gO = 0. This establishes the desired result forn =0. O 


The second half of the puzzle is solved next: 


Lemma 10.24. For any 7 € [0,1] and function H: Xx U>R, 
(i) Eo |(—H(®(n)) + {O(n + 1) A a }A(@(n + 1))) Gr] = —Ew [H(G(0))(G(0))] 
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(ii) Eo [(—H(®(n)) + yH(®(n + 1))) Gn] = Eo [{-H(8(0)) + y* A(z") }4(8(0))] 
Proof. The proof of (i) is obtained as a corollary to the previous lemma, using 
g(z) = —A(z) + yE[1{®(n + 1) F 2} A(@(n + 1) | O(n) =z], 
so that by the smoothing property, 


Eo[(—H(®(n)) + 1{O(n +1) A Z}A(@(n + 1)))Cn] = Ew [g(®(n)) Gn] 


It remains to prove that Ea[g(®(n))¢n] = —Ea[H(®(n))Y(®(n))]. 
Lemma, 10.23 gives 


Ealg(®(n))cn] = Ealg(®(0))Co] = Ealo(®(0))G(2(0))] 
1 


The smoothing property of the conditional expectation provides a useful representation for each 
expectation: 


Ea [y((0))1{k < re} 9(O(K))] 
= Eo[d((0))1{k < r.}{—-H(P(k)) + YE[1{O(k + 1) A A} A(G(k + 1)) | Fi] f] 
= Eo[y(®(0))1{k < t}{-H(®(k)) + L{B(k + 1) A eS A(@(k + 1))}] 


So that on substitution, 


Eola =- Yo Fol¥ 0))1{k < re} H(®(k))] 
+ Drte ol 0))a{k < re} {yl{O(k +1) 4 a} H(O(k + 1))}] 
eo /v(®(0) > +" H(®(k)) 


DD 3E K+lafO(k +1) ~ 2*}H(®(k +1)) 


The difference of sums reduces to y(®(0))H(®(0)) (all other terms cancel), giving (i). 
The proof of (ii) is the same, except that all but two terms cancel. O 


Proof of Thm. 10.11. Applying (10.21) we have 


A= Eo[An+41] =E® [Sn {—W(n) ot yI{P(n os 1) # 2} da+)}"] 
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Next apply Lemma 10.24 (i) using with H = y, for arbitrary i. Letting A’ denote the ith column 
of A, the lemma implies 


AY = Ea [{—di(®(n)) + {O(n + 1) F 2" }i(G(n + 1) fon] = Eo [Wi(®(0))4(B(0))] 


This establishes part (i): A = —RY. 
For (ii), consider the first order condition for optimality for a parameter 6° € R?@: 


0 = Vo3Eo[(H°(®(n)) — J,(®(n)))"]|g_ge = Eo [(H® (®(n)) — 4(H(n))) deny] 
By definition, Eg[H® (®(n)) vm) = R(0)0° = —Aé°. Applying Lemma 10.23, 
—E@ [F(@(n)) vin) | = —EalcenGn] = b 


where b = E[bn+1] (see (10.21)). Hence the first order condition for optimality becomes — A” +b = 0 
as claimed. o 


Proof of Thm. 10.15. We first establish (i): 7 = 6(u,H®’) for any solution to f(0*) = 0. For this 
we take g(z) = c(z) — 6(u, H®’) and apply Lemma. 10.24 (ii): 


0 = f(6") = Eo[{g9(®(n)) — H°(B(n)) + H°(®(n +1) Gn] 
= Ea[{9(®(0)) — H® (&(0)) + A® (z*)}4((0))] 


0=6" (10.57) 


where 


We next use the special property of 0°: 
OS 3 6;fi(0") = 3 9:Ex [Wi((0)){G((0)) — H” (®(0)) + H” (z")}] 
= Ea [1{(0) = 2*}{9(®(0)) — H” (®(0)) + A (z)}] 
Substituting the definitions, 
0 = Ea [1{(0) = 2°} {9((0)) — H” (®(0)) + A” (z")}] 
= @{2"}9() 


Under the assumptions of the theorem it follows that g(z*) = 0 and hence 7 = 6(u,H®) by 
Prop. 6.10 (Kac’s Theorem), which is (i). 

We also conclude that g = H3, so that (10.57) implies (ii). 

The remainder of the proof follows the proof of Thm. 10.11. First, revisiting (10.57) with the 
knowledge that g = ¢, 


0 =E[{H® (®(0)) — r* — H3(©(0))}¥(%(0))] (10.58) 


with r* = H®’(z*). We next show that this equation characterizes optimality of (10.34). 
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The first order condition for optimality of (6°, r°) € R¢*! for the minimization (10.34) is 


The first equation is obtained on taking the gradient with respect to 0, and the second by taking 
the derivative with respect to r. The second equation follows from the first under the assumption 
that 1 is in the span of {w;}. Hence (10.58) implies that (6*,r*) satisfies the first order optimality 
condition. O 


10.10 Notes 


Adjoints and TD learning Most of this theory in this chapter was developed by the MIT group 
led by John Tsitsiklis in a single decade starting in the early 1990s. 

The wonderful geometry surrounding TD learning is part of the dissertation of Ben Van Roy [363], 
which summarizes several remarkable papers, including [355, 356, 354, 353]. Prop. 10.1 appears as 
(356, Thm. 1], along with bounds for any value of A: with 0*(A) the solution to TD(A), 


|?" — Ql < F—TN@-@lla,  Q= HO 


The use of regeneration to define the eligibility sequence (10.19a), inspired by the older repre- 
sentation of Poisson’s equation H3 in (10.17), was introduced in [192] upon realizing that TD(1) 
learning could be used to obtained unbiased gradient estimates. Nummelin’s monograph [277] has 
had great impact in both statistics and Markov chain theory. One significant contribution of his 
research, and highlighted in his book, is how regeneration times can be constructed for a Markov 
chain on a continuous state space. This may be valuable in future research. 


Actor-critic methods Glynn’s research in the 1980s [142, 143] introduced likelihood ratio meth- 
ods for stochastic gradient descent without bias, based in part on [313] (i.e., the sensitivity theory 
surveyed in Section 6.8). Just over one decade later this was extended and applied to obtain the 
first unbiased stochastic gradient descent approach for reinforcement learning [239, 240] (see also 
[32, 31]), followed soon after with new insights in [344]. This work was the start of the actor-critic 
revolution that followed. 

Two time-scale stochastic approximation and associated variance theory was still evolving in the 
late 90s. Konda’s research with Borkar [189, 190] and then Tsitsiklis [188, 193] helped to shed light 
on this topic, and [188, 192] introduced a major advancement in actor-critic theory and application: 
the compatible features property (10.47) was introduced in this work, based on the prior Lz theory 
surveyed above [355, 356, 354, 353]. 

The introduction of the advantage function as a means to accelerate actor-critic algorithms was 
proposed in [172], following [23]. Prop. 9.2 of [172] was later applied to obtain the trust region 
policy optimization (TRPO) algorithm of [311], which spawned many other approaches. 

The regenerative structure leading to the i.i.d. samples {S,,} appearing in (10.24) suggests that 
the policy should be frozen between regeneration times in algorithm implementation, similar to the 
way qSGD was applied for the Mountain Car example in Section 4.7.1. This was one approach in 
(239, 240], and similar “episodic” approaches are used in [311, 312]. 
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The better advantage control variate technique is new, along with the covariance comparison 
Prop. 10.22. 

The natural actor-critic algorithm was introduced in [173] (see also [53, 285, 151, 56]). The 
value of the natural gradient in terms of acceleration is explained in [2]; also, in this work and 
the concurrent research [48, 248] it is shown that a large class of policy gradient algorithms are 
globally convergent with appropriate regularization (recall R,, below (10.46)). The recent article 
[368] provides an elegant Lyapunov analysis for bandit problems, and [237] contains a survey of 
actor-critic methods. 

Absent from this chapter is any discussion on how to choose the regularizer in (10.46) for 
application to gradient descent. See [270] for an approach intended to improve both policy and 
value function approximation. 

Left out of this chapter is any mention of the application of the gradient free techniques em- 
phasized in Sections 4.6 and 4.7 based on SPSA: the algorithms presented there were actually first 
proposed in a stochastic setting—a history can be found in Section 4.11. Williams’ REINFORCE 
algorithm [376] may be the first to apply these techniques to MDPs to create actor-only algorithms 
of the form described in Section 4.6. 

More recently, in [237] it is argued that SPSA is sometimes more efficient than any of the actor- 

critic algorithms introduced to-date. Algorithm 1 of [237] is a version of SPSA, and essentially 
the same as the original Kiefer-Wolfowitz algorithm [182]. However, the efficiency results of [237] 
deserve a warning label: any technique involving multiple function evaluations for a single gradient 
estimate must take into account observation noise. This concern is the reason for emphasis on 
qSGD methods #1 and #2 in Section 4.6 of this book. 
Some ancient history Many in the RL community have lost track of the elegant control theory 
for Markov chains pioneered by Mandl [236] and Borkar [61, 62, 63]. This work is easy to miss 
because the language and notation is so different from what is used today in the RL literature. 
The authors begin with a family of transition matrices {Ty : 0 € R7} on a state space Z (in later 
papers the theory is extended to general state spaces and continuous time). One goal is to find the 
parameter 6* that minimizes the average cost. 

A reader might dismiss this as being far from RL, since the family of models is assumed known. 
A closer look reveals that the observations are log-likelihoods, Lg = log(T»/Tpo), where 0° is fixed 
but arbitrary (it plays no essential role in the theory). The setting is thus far more general than 
this chapter. For the very special case (10.37), 


Th(z,z') _,  b (u'| 2’) 
| 0g v go ’ 
To (2, 2’) b (u! | 2’) 


This “ancient history” deserves closer inspection, as well as the substantial concurrent work in the 
USSR [136]. 


Fisher information The interpretation of (10.54) as Fisher information is not easily justified. 

The term arises in the theory of parameter estimation [361, 6]. In the context of this chapter, 
the estimation problem is the same as in Borkar’s thesis [62, 63]: the input is chosen according to 
)® for some 6° € R¢. Given samples {®(k) : 0 < k < n}, the maximum likelihood estimate 0 of 
the true parameter 6° is a solution to 


n—-1 w 
0=Voe les( TT To(®(k), O(k + 1))) ea = a) ee 
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In this context, the Fisher information is the normalized covariance of this statistic, evaluated at 
0=0°: 


F, (6°) = Meow(S> Afi) 
k=1 


Due to the martingale difference property established Lemma 10.20, 


. ° 1 . 6° ° 
lim, Fn(0°) = — 2, r(Aie) = F(6°) 
with the right hand side defined in (10.54). 

The matrix F,,(0°) is a measure of the sensitivity of the first n observations to the parameter 
6°, and from this limit we can justify the interpretation of F'(@°) as the sensitivity with respect 
to the entire history of observations. This is an elegant conclusion for applications to parameter 
estimation, but doesn’t explain why the inverse of F'(@,,) is a good gain for applications to actor- 
critic algorithms. 


Pre-publication draft -- March 25, 2022 


Appendix A 


Mathematical Background 


A.1 Notation and Math Background 


This section reviews basic notation and concepts from calculus and real analysis. 


Generalities Concerning functions and sequences. 

> 14: Indicator function of a set A. This means that 14(x) = 1 when z € A and 0 otherwise. 

p> J(-): the “dot” is used to stress that J is a function of some variable. 

> wu: bold face is compact notation used to designate a sequence. Alternative notation: wu = 
{uo, u1, on .} = U[0,c0): 

> For a function J: R” > R, do not forget that the gradient VJ is not the same as the derivative 


OJ. The gradient is a column vector, and the derivative a row vector (the linear approximation of 
the scalar valued function J). 


Topology Concerning sets and sequences in R”. 
> Neighborhood of x € R”: a set containing x that is open. 
> Aset S CR” is called compact if it is closed and bounded. 


> A collection of sets {O, : v € N} (where the index NV may be uncountably infinite) is called 
a covering of a set S C R"” if O, C R” for each vy € N, and S Cc UJO,. Many proofs in the book 
make implicit use of the following characterization: a set S is compact if and only if every covering 
by open sets admits a finite sub-covering [there is a finite collection of indices {y; : 1 <i< N} 
satisfying S C UM, O,,]- 


> For a function g: Z > R, the span seminorm is defined by 
l|gllsp = min max |g(z) — r| = 3[max g(z) — min g(z)] 


Equation (10.34) shows an example of a variation of this norm. 


The following notation refers to a collection of vectors or scalars indexed by a variable t € 
T interpreted either as time, or the number of iterations in an algorithm. Examples: Z. = 
{0,1,2,...} and R,. 


389 
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> supa: supremum of the scalars {a; : t € T}, also known as the least upper bound (LUB). 
t 


Denote the set of all upper bounds by S = {r€ R: a <r for eacht € T}. If {az} is bounded 
from above, then this set is non-empty and can be expressed [s9,0o) for some so € R. This is by 
definition the supremum: sup a; = min{s: s € S} = so. 

t 
> The infimum is the greatest lower bound, or inf{a¢} = — sup{—a¢}. 


> limsup6;: limit supremum, defined as follows for a scalar-valued function of t. First denote 
t-co 


dp = sup{d,:t > r}, reT 


As r increases, 6, cannot increase (the supremum is taken over a smaller set). The limit as r + oo 
is the limit supremum: 
lim sup 6; = lim 6, 
t—00 Toco 


> liminf 6; = —limsup(—6) : limit infimum 
t—00 t-00 


> We say the limit exists if lim inf 6; = lim sup 6+. 
too too 


> For two scalar-valued functions of time {a;, b; :t € T}: 
A a; = O(b;): the ratio is bounded, so that |a;| < B|b,| for some constant B and all t € T. 


A az = 0(b:): Jim az/b; = 0. 


> We also consider the parameter tending to zero rather than infinity. For example, this bound 
might be anticipated in Section 4.6: 


(6 +e) =1(6) +e€' VI(6) + ofe), e>0 


This is shorthand for the following limit: 


1 

lim =|7(0 +8)-{T() +e&TVI(6)}| =¢ 
elO € 

Linear algebra Concerning n-dimensional vectors and n x n matrices. 

> vu-w: for two vectors of the same dimension, this is the inner product. Written as column 

vectors: v-w = vl w 


> Singular values of a matrix R: {o,...,o,}. First obtain the n eigenvalues {\;} of RRT, and 
then define o; = Vij. 


> Condition number of R: the ratio of the maximum and minimum singular values. 
> Positive definite: R > 0 means that «7TRa > 0 whenever x € R” with « 40 


> Positive semidefinite: R > 0 means that xz’ Rx > 0 whenever x € R” 
Note: In this book the statement that R is positive definite, or positive semidefinite, carries 
with it the hidden assumption that the matrix is symmetric: R= RT’. 


> Matrix Inversion Lemma: for matrices A, U and V of compatible dimension, 


(A+UCV) 1 =41- AU (C1 4VA4U) Va (A.1) 
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A.2 Probability and Markovian Background 


A prerequisite of Part 2 of the book is some understanding of stochastic processes, which means 
you need to know the meaning of (0,.,P) and related machinery. 


A.2.1 Events and Sample Space 


Here we make precise the meaning of a probability space, and introduce the shift-operator that 
formally defines the Markov property. 


Events What is (Q,7,P)? First, recall that the o-field F denotes the set of events. Each event 
A €  F must be a subset of 2, and ¥ defines the domain of the probability measure P. By “domain” 
we mean that P(A) is defined only if A € F, and not for any other subset A Cc . 

To be a o-field, * must be closed under countable unions and finite intersections. Please review 
this material, along with the definition of sub-o-fields and conditional expectation. The first chapter 
of Hajek’s textbook [154] is a great reference. 

A (real-valued) random variable H is a mapping H: 2 > R that is measurable with respect to 
F. That is, the set E, = {w: H(w) < c} is an event (ie., E. € F) for each c € R. When we write 
P{H <c}, this is shorthand for the probability of the event: P{E-}. 

A stochastic process is a family of random variables indexed by time. If we take discrete time, 
and restrict to times k > 0, then a stochastic process is a sequence of random variables denoted 
X = {X;,:k © Z,}. Subscripts are adopted in this appendix to save space, and because we need 
to stress that X;, is a function on (2) for each k. 

Suppose that each random variable takes values in a discrete set X. For an integer N > 1, 
denote by Fy C F the smallest o-field that contains events of the form 


def 


E(ao,..,0n) = (w EO: Xz(w) = aR, OST SN} (A.2) 


for any collection {x;} C X. If H is a random variable on (Q,Fy,P), then there is a function 
h: XN+1 -s R such that 


H(w) = h(Xo(w), X1(w),..., Xn (w)), we 


Sample Space The set (2 is called the sample space, whose definition is a modeling choice. When 
studying a single stochastic process, it is convenient to choose the set of all possible state sequences. 
That is, each w € (2) is a sequence of states: 


W = (Wo,W1,W2,...) with w; € X for each i. 
It is interpreted as a possible realization of the stochastic process, so that X;,(w) = w,. When Q is 
defined in this way, the event defined in (A.2) becomes 


Ey {we Q: uj =x, 0<i<N} (A.3) 


L0,-.-,LN) 
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And in this case, we typically take F to be the smallest o-field that contains all events of the form 
(A.3), where N ranges over all integers N > 0, and the {x;} can take on any value in X. This 
o-field contains any event of interest, including 


;72 
{we Ms lim = ews) =n} 
i=0 
where c: X > R is any function, and 7 € R any constant. 


A.2.2. Markov chain basics 


We will keep things simple and assume that the state space X for the Markov chain is finite or 
countably infinite. The transition matrix is denoted P. 

For each pmf pu on X, there is a probability measure P, on F defined so that p is the initial 
distribution for the chain. Consistent with definitions in Chapter 6, 


Pal Xo = 2} = w(x) 


Pid xy Sz} = > u(x’) PF (x! , x) xEX 
x'EX 


Eulg(Xe)] = >> Pu{Xe = z}9(x) for any g: XR 
When u is degenerate, in the sense that u(a) = 1 for some x, then we write P, and E,. 


The shift operators are mappings on Q that provide compact language for complex concepts. 
For each k, the shift operator 6* maps an element w = {9,21,...,2n,---} € © to a new value via 


Ow = oe ae eee ee ne cease 
It defines a transformation on random variables H by 
(O° H)(w) = H(6*w). 
Hence if the random variable H is of the form H = h(Xo, Xj,...) for a function h, then 
OX H = h( Xz, Xe41,---) 
Specializing to H = h(X,,) for some n and some h: X > R gives 
OH = h(Xn+k) 
and hence 


E.lO°H | Fx] = Evlh(Xn+e) | Fe] 
= E,[h(Xn+e) | Xx] by the Markov property 


by definition of the transition matrix 


This can be generalized: for any initial distribution Xg ~ u, any bounded random variable H, and 
fixed k,n € Zy: 


E,.[0"H | Fi] = Ex[H] 


= a.s. [Pu (A.4) 
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This describes the (time homogeneous) Markov property in a succinct way. 
Note that we are viewing E,[H] as a real-valued function on the state-space. Henceforth we 
will substitute: 


Ex, [H] =Es[H]| 


A.2.3 Strong Markov property 


The Strong Markov Property is described by a significant extension of the formula (A.4). The 
definition of this property requires these three ingredients: 


(i) A function 7: Q > ZU {oo} is a stopping time for X if the event {7 = n} lies in F,, for 
each n € Z+. That is, for each n there is a function fy such that 


Ife =a} = fal Aaya seg Xn) 
(ii) The associated shift operator 0” is defined exactly as above: 
CH SX X aitiga 2s) 
(iii) An associated o-field: 
Fr ={A€F:{r=n}N AEF for each ne Zy}. (A.5) 


Interpreted as the events which happen “up to time 7”. 
Two important examples of stopping times are, for any set A C X, 
ta & min{fn>1:X,€ A} 


ef 


oA = min{n>0:X,€ A} 


a 


known as the first return and first hitting times on A, respectively. 


Proposition A.1. For any set A C X, the variables Ta and o, are stopping times for X. 


Proof. The random variables 74 and a, have this representation, for any n > 1, 


n-1 
{ta =n} = fn(Xo,---,Xn) =1{Xn € A} [] 1{%; ¢ A} 
i=1 


1{o, = n} = gn(Xo, tee pn) = Ini 20s tee ,Xn)1{Xo g A} 


where fle “1 (handles the case n = 1). 
For n = 0 we have 1{74 = 0} = fo(Xo) = 0, and 1{o4 = 0} = go(Xo) 2 1{ Xp € A}. O 


A finite-valued random variable H that is #,-measurable can be expressed as an infinite sum: 


P=) fel Xoo Xl =n} 
n=0 


Pre-publication draft -- March 25, 2022 


APPENDIX A. MATHEMATICAL BACKGROUND 394 


for a sequence of functions {h,}. This is true so that we have the required property: 
Hi{r =k} =hg(Xo,...,X~)1{7 =k} is F_-measurable for each k. 
An example is the random variable X,, defined by setting X, = X, on the event {7 = n}: 
(oe) 
X,= 0 Xpl{r =n} 
n=0 
Finally, we come to a key definition: 
Strong Markov Property. X has the Strong Markov Property if for any initial distribution 
u, any real-valued bounded random variable H, and any stopping time 7, 


E,[0"H | F,]=Ex,[H] as. [PJ], on the event {7 < oo}. (A.6) 


Proposition A.2. For a Markov chain X with discrete time parameter, the Strong Markov 
Property always holds. 


Proof. This is a consequence of decomposing the expectations on both sides of (A.6) over the set 
where {7 = nr}, and using the ordinary Markov property, in the form of equation (A.4), at each of 
these fixed times n. Oo 
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Appendix B 


Markov Decision Processes 


This section mirrors the optimal control theory surveyed in Chapter 3. See Section 7.1 for the 
definition of an MDP and surrounding notation. Throughout the remainder of the appendix it is 
assumed that the state space X and action space U are finite. 


B.1 Total Cost and Every Other Criterion 


The definition of the total cost value function J* defined in (3.2) is unchanged, except that we 
introduce an expectation, and change the notation: 


h*(x) = min S- Exle(®,)] (B.1) 
k=0 


where ®; = (X,, U;), the minimum is over all admissible policies, and the subscript indicates that 
Xo = 2. 
When finite, this value function solves the Bellman equation 


a= min{ e(2, 1) + Yo Pala,x'h*(a')} , 2£EX 
g! 
This is very similar to the dynamic programming equation (3.5) for J*, especially when expressed 
in the equivalent sample path form: 
h* (Xp) = c(®,) + E[h* (X41) | Fx| when U; = b* (Xx) 


Also similar to the deterministic setting, the optimal policy is any minimizer: 
*(x) € arg min{ e(2, 1) + > Py(a, 2" )n*(a’)} , ZEX 
U a! 


There is however a significant difference in the stochastic control formulation: in most cases 
we find that E,.[c(®;)] converges to a strictly positive constant, as k + oo, meaning that h* is not 
finite. Why then should we care about (B.1)? 


395 
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B.1.1 Total cost in many flavors 


What follows are several examples of optimization criteria for MDPs, and how they can be trans- 
formed to the total cost criterion. 


Shortest Path Problem (SPP). Given a target set S C X and a terminal cost Vo: S > R, 
define for each x € S®: 


Tg—1 
h(x) = minE,| > e(®x) + Vo(Xrs) 
k=0 


This is transformed to the total cost problem through the enlargement of X to include a graveyard 
state denoted A. The “graveyard property” indicates that P,(4,4) = 1 for each u, and we also 
impose P,,(x, 4) = 1, for z € S and any u. The cost function is also modified: c(x,u) = Vo(a) for 
x € S, and c(A,u) = 0 for each u. Under these conventions, the value function for the SPP can be 
represented as (B.1). 


Discounted Cost Value Function. For a discount factor 7 € [0, 1), 


h*(x) = min > Ez [e(®;z)| (B.2) 
k=0 


Enlarge the state space for the MDP with a geometric random variable T with parameter 7, 


and independent of everything: 
PIT > k| ®P} = 4! 


Then, UV, = (Xz, By) is a Markovian state, with B, = 1{T > k}. Independence of X and B 
implies that 


Consequently, 


with ¢c(z,u) = c(#, u)b when z = (z, b). 


Most surprising is the transformation of the average cost criterion: 


Average Cost Optimal Control. Denote for any input sequence, 


n—-1 
nu (2) = limsup — Y~ Ex[e(®,)] (B.3) 
k=0 


n—->Cco 


The minimum over all admissible inputs is denoted 7*(z). 


Pre-publication draft -- March 25, 2022 


APPENDIX B. MARKOV DECISION PROCESSES 397 


The minimum is typically independent of x, and the solution is obtained via a SPP with modified 


cost. Consider 
Tg—1 


* sy * * 
h*(x) = min E, S> {e(®x) — 0°} (B.4) 
k=0 
with S = {x*} a singleton, so this is the first return time used in Thm. 6.3: 
ip] mime > 13g = a} (B.5) 


If 7*(x*) > 0, with 7* the invariant pmf under the optimal policy, then h* solves the average cost 
optimality equation (ACOE): 


min{¢(a, u) + P,h* (x)} = h*(x) + * (B.6) 


The function h* is known as the relative value function, and the minimizer is a stationary policy 
that achieves the optimal average cost: 


(2) = arg min{¢(s, u) + P,h* (a) } 


B.2 Computational Aspects of MDPs 


The value iteration and policy iteration (or improvement) algorithms each have extensions to the 
MDP setting. These techniques are reviewed here, along with a linear programming approach that 
is related to the LPs introduced in Section 3.5. 

This section is devoted to the ACOE (B.6). The relative value function h* is not unique — we 
can always add a constant to obtain a new solution. It is convenient here to impose the additional 
constraint h*(x*) = 7*, where x* is some distinguished state. Under mild conditions the solution is 
then unique, and we eliminate 7* from the ACOE: 


min{¢e(a, u) + Pyh* (z)} = h*(x) + h*(x*) (B.7) 


Algorithm design begins with the representation of (B.7) as a fixed point equation: let T’ denote 
the functional that takes any function h: X — R, and creates a new function via, 


T(h) 


= min{ e(2, u) + P,h(z)} — h(z’), x EX. 


x 


The ACOE can be expressed as the fixed point equation 
= Th") (B.8) 


There are two common approaches to solve a fixed point equation: The first one is successive 
approximation, which leads to value iteration (VIA). The second approach is the Newton-Raphson 
method, which leads to the policy improvement (PIA) method. 
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B.2.1 Value Iteration and Successive Approximation 


In successive approximation we initialize with a function ho: X > R, and then for n > 0 


An4i(x) = T(hn) 


= min{e(2, u) + Puln (x)} — hn(x’), LEX (B.9) 


x 


The value iteration algorithm is obtained by disregarding the constant h,(2*). 


Value Iteration Algorithm (VIA) 


Initialized with a function Vo: X > R. Then, for each n > 0, 
Vals) = min{¢e(«, u) + PuVn (x)} (B.10) 
A policy at stage n is defined as the minimizer: 


b, (x) = arg min{c(s, u) + PuVn (x)} 


VIA solves a finite-horizon optimal control problem: 


Proposition B.1. At stage n, we have a sequence of policies (do,...,Pn—1). The function Vp, 
can be expressed as 


n—-1 
Va(a) = min E, [2 Ae Ue) +Vo(Xn)| (B.11) 
k= 


where the minimum is over all admissible inputs. There is a minimizer that is Markov, but not 
necessarily stationary: 
Uk = On_-n (Xp), OS k<n-1 (B.12) 


O 


Suppose that {V,,} are obtained using VIA, and {h,,} are obtained using successive approxima- 
tion, with common initialization ho = Vo. It can be shown by induction on n that 


hn(x) — hn(a*) = Vn(x) — Vn(a") for each x and each n > 0. 
Consequently, we have convergence under mild assumptions [45]: 


dim [Vn(@) — Vn(a")] = h*(2) (B.13) 


It is argued in [126] that the rate of convergence can be improved by introducing additional “control 
loops” in the VIA recursion (B.10). 


B.2.2 Policy Improvement and Newton-Raphson 


The second well known algorithm is far more complex per iteration, but often converges very 
quickly. 
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Policy Improvement Algorithm (PIA) 


Given an initial policy do, a sequence (dn, hn) is constructed as follows: At stage n, given dn, 


(i) Solve Poisson’s equation 
Pahn = hn - Cn +n 


where c,(x) = c(x,by(x)) for each x, Np, is the steady-state cost using the policy },, and P, 
is the transition matrix obtained when the chain is controlled using Oy. 


(ii) Construct a new policy: 


Pn41(x) € argmin{e(x, u) + Pubn(x)}, cee (B.14) 


The PIA is in fact a special case of the Newton-Raphson method, in which the function T 
appearing in (B.8) is replaced by its linearization T,, to obtain a sequence of approximations to h*. 
Given some function h,, the mapping 7), is defined via a first order Taylor expansion, 


T,(h) = T (hn) a Dp(h _ hn) (B.15) 


where D,, = VT (hy) is adxd matrix when X consists of d elements. The Newton-Raphson method 
then defines h,+1 to be a solution of the linear equation, 


h =T,(h) = T (hn) + Dn(h — hn) (B.16) 


This is illustrated in Fig. B.1 


> 


Figure B.1: PIA interpreted as an application of Newton-Raphson. The functional T is piecewise linear for a 
finite-state / finite-action MDP. 


The functional J’ may not be differentiable, but we can always find a sub-gradient D,. This 
means that for any h, and any other function g, 


T (hn + 9) S T(hn) + Dn(9) 


where the inequality is interpreted point-wise [remember, T(h) and D(g) are functions on X]. The 
existence of sub-gradients is assured because T is a concave function of its arguments: for any two 
functions g, h, 


T(ah + (1—a)g) > aT(h) + (1—-a)T(Q) 0<a< 1, 


The function h — T(h) is convez, as illustrated in Fig. B.1. 
The following lemma shows how to obtain the gradient of T(h). 
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Lemma B.2. For any function h: X > R, let db, denote a policy satisfying, 


+(x) € arg min{c(z,u) + Pyh(z)}, rex 


Let Py = Pp, and let S, denote the substitution operator, defined for any function g: X + R by 
S.g|,=9(0"), LEX 


Then, a subgradient of T at h is given by D, = Py — S,. 


Proof. To prove the lemma we must establish the following pointwise bound for any function g: 
T(h +g) <T(h) + Pyg — g(2") 


It will then follow that D;,(g) = P:g — g(z’). 
Denote the evaluation T(h+ g) at x by T(h+q) Lg The notation Ph (x) and Ph, both denote 
the evaluation of Ph at x. Then, on setting c,(x) = c(x, b+(z)), 
T(h+ 9). = min{¢(, ) +Py(h+ 9)| } — h(a") — g(x") 
< cy(a) + Py(h +g) (a) — h(x") — g(a") 
= {ex (0) + Pyh (x) — h(")} — 9(2*) + Pag (2) 
=T(h)| - 9(2") + Peg (2) 


(B.17) 


which is the desired bound [the inequality is obtained on replacing the minimum over u with the 
particular value u = o4+(2)]. O 


Given the function hy, denote @n+1(z) € arg min, {¢c(x, u) +PyNn (x)}, and denote the resulting 
transition law and cost function as follows: 


Pri = Figs ’ Cn41(2, u) = c(a, n4i(z)) 
That is, Oy+1 is the feedback law 4+ given in the lemma with h = h,, and P,41 is precisely Py. 
Letting g = hn+1 — hn, the lemma provides the following representation of the Newton Raphson 
update (B.16): 
An+1(@) = T(An)| + Pati (nti — Rn) (&) — Pngi(a’) + hn(2’) 
x 


= {en41(#) + Po4thn (t) — hn(2*)} (B.18) 
+ Pati(hnti — hn) (£2) — hnyi(2*) + hy(2’) 


Canceling terms, we conclude that hn+1 satisfies the following fixed point equation, 
Patthngi (@) = hn41 (©) — en41(#) + hn41(2") (B.19) 
On writing M41 = hn+41(z*), the identity (B.19) becomes Poisson’s equation: 


Pei aja = An41 — Cnt1 + n41 


This is precisely the PIA. 
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B.2.3 LP Formulations 


The basic idea of the LP approach is as follows. Let @ denote a pmf on state-action pairs (x, u) € 
X x U. We denote by G the set of all possible limits of the empirical pmf: 
{Wa 
eae) De Up =u, rex, weu. 

That is, @ € G if there is an admissible input U and a subsequence {N;} such that, 

lim @y, (a, u) = @(2, uv) all x, u (B.20) 

4100 
It is known that G is a polyhedron for which a simple characterization is easily obtained: 


1. Any @ €G isa pmfon X x U. 
A factorization is obtained via Bayes’ rule: 


@(x,u) = n(x) b(ulz) (B.21) 
where 71(x) = ~ @(z,u), (ul x)= ara : all x, u. 


U 
Let Ps denote the transition matrix 
Ps(e,2') = > b(u| 2) Pyle"). 
U 

2. 7 is an invariant pmf for Ps. The main step in the proof is the following consequence of (B.20): 

lim @y,41(2, u) = @(z, u) all z,u 

1 CO 

We arrive at a “DPLP” for the ACOE: 


ACOE Linear Program 


n* = min Ss" @(x,u)c(x, u) (B.22a) 
s.t. Ss" @(x,u)P, (2,2) = S- @(x’,u), 2 EX (B.22b) 
Sl a(z,u)=1, @20 (B.22c) 


Justification is provided in the following. See [64] or [254, Section 9.2] for details. 


Proposition B.3. The set G is the convex set characterized by (B.22b, B.22c). The optimizer 
@* defines an optimal policy )* via the factorization (B.21). Oo 


The LP approach is more flexible to extensions. For example, to multi-objective optimal control: 
Given a collection of functions {c’} and bounds {7} we can consider, 


min (@,c) st. @OEG. 
g.c)< 7 each i 
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For those of you who know something about linear programs First, the extreme points 
of this LP admit the factorization (B.21) with @ deterministic ((u | x) is zero or one for each 2, u) 
and m “ergodic” (the chain restricted to the support of 7 is irreducible). 

There is something called the dual of an LP: 


minc'z s.t. Ar <b,x>0 {o> maxblé s.t. ATE >c,€E>0 
The dual of (B.22) can be reduced to a version of the ACOE: 


max Zz 


s.t. c(x)—z+ » Pu(a, y)h(y) — h(x) > 0, rex, uweu. (B.23) 
yex 


This looks very much like the DPLP (3.36), and its dual (5.82) resembles (B.22). 


Pre-publication draft -- March 25, 2022 


Appendix C 


Partial Observations and Belief States 


We now have an observation process Y, and no direct measurements of the state process X. This 
short survey concerns a model in which X, Y, and the input process U take values in finite sets, 
denoted X, Y, and U. 


We continue to write ®, = (X,,,U,) when convenient. 


C.1 POMDP Model 


The state process is a controlled Markov chain as before. The observation Y, is assumed to be a 
noisy, memoryless measurement of the state X,,, in the following sense: a family of pmfs on Y is 
given, denoted {q(- |): a © X}. For each y € Y and rE X, 


P{Y, =y| Op,k <n} =qy| 2) on the event Xn = 2. (C.1) 


Our goal remains the same: We wish to optimize some performance criterion. This may be dis- 
counted cost, finite-horizon, or the average cost which has been the focus of the appendices up to 


now: 
T-1 


1 
= limsup = c(® 


The optimization problem becomes more complex when only Y is available for choosing U. 


It is rarely true that the optimal input is of the form U,, = (Y;,). We need the full history in 
general: for a sequence of functions {@,,}, 


Us Saino (C.2) 


This sounds horrible, but we will soon discover beautiful structure. 


The combined dynamics can be realized through coupled system equations: 


Xn41 3 Foes Un, Nn+1) 
Yn41 = G Gigs Wn+1) ; n=O, 


403 
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where (N,W) is i.i.d., and mutually independent. This gives 


p(2’ | x, u) = P,(z,2') = P{f(z,u,N(1)) =2'} and gy’ | 2) = P{g(2’, WA) = y'} 
It is sometimes useful to view the pair (X,Y) as the state process for an MDP model: 
P{Xn41 =i aga =y/ | Xn = 2, Yn =y,Un = u} 
=P{f(t,u,Nnyi)=2' and 9(2',Wr41) = y'} (C.4) 
= p(x" | x, u)q(y’ | x’) 
This gives the controlled transition matrix for the joint state-observation process: 
T,,(z, 2) = ple | w, waty | a’), with z = (x,y) and 2’ = (a’,y’). (C.5) 


Given observations of (X,Y), we can apply our machinery to compute or approximate an optimal 
policy U, = b*(Xn; Yn). 

However, the PO in POMDP stands for partial observations. This means that inputs are 
restricted to the form (C.2) We need to develop new tools to respect the limited information for 
control that is captured in (C.2). 


C.2 A Fully Observed MDP 


Belief State. The partially observed MDP can be recast as fully observed, provided we 
change our definition of ‘state’. 


The new state process {b, : n > 0} is also called the belief state, and coincides with the 
conditional distribution of the state given the observations: for each x € X, 


be — Pe — La (C.6) 
im winch Vp — ol Yn kh 7) 


The belief state evolves on what is called the simplex of pmfs on X, denoted S. If X consists 


of d states, then 
s={b eR pul (i =i 


The definition implies that for any function h, 


E[h(Xn) | Yn] = D0 bala 


LEX 


In this short survey we restrict to the finite-horizon optimal control problem to explain the MDP 
construction. The extension to average-cost and discounted-cost will be obvious. 
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For fixed N > 1 and 0 < n < N, assume that some input has been applied for k < n, and 
denote the cost to go function, 


N-1 
ee 
Vit = min E[}> (®,) + Vo(Xw) | In (C.7) 
% k=n 
This is random and grows in complexity with increasing n, since ve n isa function of Yo,..., Yn. In 
order to mimic the characterization of optimal policies from the fully observed setting, we require 
a sufficient statistic I = {Ip,...,In,... } evolving on some fixed space, such that for some function 


nN 
Vinw = Vn,w Un) 
Since this is a time-homogeneous model, we might hope to have ye n = Vr_nUn) for a sequence 
of functions {V*,:m > 0}. 

The amazing conclusion is this: a sufficient statistic does exist, with J, = b,. For this reason, 
the belief state 6, is sometimes called the information state. Moreover, the stochastic process 
b = {bo, bi,...} ts itself a controlled Markov model that is fully observed. The Markov property 
is explained in Prop. C.1. A deterministic stationary Markov policy is defined via a feedback law 
U, = O(b,), where ©: S > U. 

Prop. C.1 states that there is a mapping M: S x Y x US such that for each n > 0, 


bn+1 = M(bn, Yn41; Un) ; nm = 0 (C.8) 


From this, a specific formula for the controlled transition kernel for {b,,} is obtained in (C.9). The 
proof is postponed to Appendix C.3 


Proposition C.1. (Transition law for the belief state) The following Markov properties 
hold: for any admissible input U, any set S C S, and any y' €Y, 


Plbngi1 € 9, Ynti = y' | Ynf = Plbng1 € S, Ynt1 = y' | bn, Un} 
P{Ona1 € 5 | Yn} =P{bni1 € S| Pay Un} 


The transition kernel for the belief state is given as follows: for anybe€ S,u€ U, andanyS CS: 
Plbnsi € S, | bn = 6, U, =u} 


=) °1{M(b,y',u) € S}P{ Yui =y! | bn =, Un =u} 
y! (C.9) 


= LUMesn) eS} De He)? x')a(y! | 2")) 


xv,xv'EX 


Moreover, for any function F:S x Y > R, 
ELF (bn41,¥n+1) | Mat Oy =, Uy = Gh 


=D LHe) Palas! ay | 2) PM Ov wu) 
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The representation of the cost to go is given in the following: 


Proposition C.2. For anyb€S andu€ U denote, 
C(b,u) #S~H(e)e(w,u), —_Vo(b) #37 H(@)Vo(e) (C.10) 


Then, the cost to go admits the representation 
N= 


* yy = Vin (bn) 2 min py C(by, Uz) + Volbn) | bn (C.11) 


Un k=n 


Proof. The cost function c: X x U — R and the terminal cost Vo are replaced with functions of 
the belief state by applying the smoothing property of conditional expectation: for any admissible 
input U and k, N > n, 


Ele(Xx, Ue) | Yn] = E[C(be, Ue) | Yn] 
C(bp, Ur) = S > be (w)e(w, Ug) = Ele(Xn, Ur) | Ye] 


aes | . = ied | Yn 
w= Ds dv(a) = E[Vo(Xw) | Yn] 


Consequently, the (possibly non-optimal) cost to go admits the representation 


N= 


Van =E[ D> c(Xn,Ux) + Vo(Xw) | Ya] 

“k=n 
_N-1 

='E C(bx,UK) + Volbw) | Vn] 
“k=n 
_N-1 

—E C(bp, Uz) + Vo(bn) | On, 5 U,| 
“k=n 


where the final equality follows from Prop. C.1. On minimizing over admissible inputs we obtain 
the representation (C.11). O 


Besides the remaining work involved in establishing Prop. C.1, the biggest open question is 
can we ever solue a POMDP? A solution can be found for the special case of linear Gaussian 
models because in this case b, is Gaussian, and hence characterized by the conditional mean and 
covariance. 

We could potentially apply value iteration to obtain the optimal policy. Given an initial value 
function Yo: S > R, we define by induction, 


V(b) = min{C( ie ee aly! | 2")Vn -1(M(b,9',u)) }, be S. 
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Or, if you are interested in the average-cost optimization problem you would solve the dynamic 
programming equation: 


nt + H*(b) = min{ C( (b, u pe aly" | x')H*(M(b,y',u)) } 


y’ aa! 


where H*: S — R and 7% is the optimal average cost for the partially-observed optimal control 
problem. 

While it is unfortunate that we have to move from the finite state space X to the simplex of pmfs 
S, we are fortunate that the value functions have simple structure: It can be shown by induction 
that V, is concave and piecewise linear as a function of b € S [326, 201]. You would rightly 
conclude that H* is also concave as a function of b. This opens the door to simple approximation 
architectures for approximate dynamic programming or reinforcement learning. 


The next section contains the proof of Prop. C.1. 


C.3 Belief State Dynamics 


A more complete and detailed exposition of much of these notes can be found in Hajek’s book 
[154] or Van Handel’s lecture notes Hidden Markov Models [362]. However, Prop. C.1 giving the 
transition law for the belief state is more difficult to find (this is usually taken for granted). 
Recall that (X,Y) is regarded as the state process for an MDP model with transition law given 
n (C.5). This notation is too complex for our purposes. Our interest is in constructing a recursive 
algorithm that generates the belief state—a function of z—so we suppress the role of observations. 
For each n, given the observed quantities y, = Y, and un—1 = Un_1, denote (following the notation 
of (C.5)), 
Prl@n | yA) = p(In | Tn—-1,; Un—1)4(Yn | Ln) (C.12) 


where it is understood that z,_; and 2, are variables. In contrast, yn, and u,—1 appearing in (C.12) 
are not variables: they are observed quantities. 
We first consider a (seemingly) more complex problem: Given a sequence 2? € X"*1, We want 
def 


to find the conditional probability that Xq = (Xo,...,Xn) = 2§, denoted 


def 


Bn(xo) = P{XG = 26 | Mn} 


This is known as a smoothing problem since we wish to estimate past states given past and present 
oo The pmf 8, can be expressed in terms of the observed input and output sequences 
Un * ue this is obtained using Bayes rule: 


Br(ag) = PLXG = 26 | Yo = yg US = ug} 


P{Xf = a8 and VP = Ua Sur 7} 
P{Y;’ = yf Ug” eile rh 


The denominator can be regarded as a normalizing constant (it does not depend on xj). Letting 
Bs denote the numerator gives, 


Br (ap) = P{XG = 26 and Yo = yg Ug”! = ug 


= Pn(n | Pr—1) X +++ X pi(x1 | Zo) x H(z) 
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where the pmf u defines the distribution for Xo. 
This is true for any n, so that we obtain a recursive relationship, 


Br(@6) = Pn(@n5| Zn—1)Bp-1(ZO ‘) (C.13) 


We then obtain 6, through normalization, since we know it is a pmf: 


Bn(2O) = KnBn (xo) ; = D0 Balto, ---+2n) 


BH yee Ph, 
What about the original problem? We again have Bayes rule: 
bit) =P =a | Sa Sat 


P{Xn = en and Yj = yp Uy! = up} 
PO, =u30, Scr} 


The numerator, denoted 6° (a), is obtained from 6% as follows: 
big) = Pia =e ond YO aur =a} 


= > Bo ig cise) 


/ / 
Loy Lp_y 


Applying (C.13) gives a recursive formula for the unnormalized belief state: 


b;, (tn) = > Pn(En | Zp-1)Bp—1 (05+ ++ Zn—1) 


I 
> 
3 
8 
3 
8 
-— 
D 
3 e 
Ly 
— 
8 
oé 
8 
-~ 
m 


From which we obtain the linear dynamics: 
=Leon(e'| shale), mbt, a ex 


From Prop. C.3 we obtain the transition matrix for the belief state process given in Prop. C.1. 
Proposition C.3. (Nonlinear filter) The belief state dynamics are nonlinear: 


bo(x") = Kon (a’)q(Yo | 2’) 
byl at) = AO As Yo Onna) = ie) Pn | 2 Bgt (Bs wv eX, n>1 


x 


where pn is defined in (C.12), and Kn is determined by the constraint >. bn(#’) = 1. O 
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